PRCP-1025: Flight Price Prediction¶

ChatGPT Image Sep 1, 2025, 09_21_35 PM.png

Project Type - Regression¶

Name - Ari R

Contribution - Individual

Project Summary:¶

  • The Flight Price Prediction dataset represents a real-world challenge in the travel and airline industry: forecasting ticket prices based on factors such as airline, route, number of stops, duration, and journey date. Flight pricing is highly dynamic and unpredictable, making it a critical problem for both airlines and travelers. By applying machine learning, stakeholders can better understand price drivers, optimize booking strategies, and improve decision-making.

  • The objective of this project is to build a predictive machine learning model capable of estimating flight ticket prices with high accuracy. Such a model can help travelers anticipate costs, enable airlines to analyze competitive pricing, and support online platforms in offering smarter fare recommendations.

  • The dataset combines categorical and numerical features, including airline, source, destination, total stops, route, duration, and departure date. Data preprocessing included handling missing values, feature engineering (extracting day and month, converting duration into numeric format), encoding categorical variables, and addressing skewness in price distribution. Exploratory Data Analysis revealed that airline type, number of stops, and duration are the most influential factors driving ticket prices.

  • Multiple machine learning models—including Linear Regression, Random Forest, XGBoost, and LightGBM—were applied to predict flight prices. Models were evaluated using R² score, RMSE, and MAE to capture accuracy and error margins. Linear Regression and Decision Tree underperformed, while ensemble models like Random Forest methods showed strong predictive power.

  • Ensemble models, particularly LightGBM and XGBoost, delivered the strongest performance, with LightGBM emerging as the best model due to its balance of accuracy, generalization, and efficiency. The project demonstrates how machine learning can be leveraged to tackle the complexity of dynamic flight pricing. By accurately predicting ticket prices, this solution can empower travelers, airlines, and booking platforms with actionable insights, driving smarter and more cost-effective travel planning.

Problem Statement:¶

Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travelers saying that flight ticket prices are so unpredictable. That’s why we will try to use machine learning to solve this problem. This can help airlines by predicting what prices they can maintain.


Task 1: Prepare a complete data analysis report on the given data.

Task 2: Create a predictive model which will help the customers to predict future flight prices and plan their journey accordingly.

Let's Begin!¶

1. Know Your Data¶

1.1. Import Libraries:¶

In [3]:
# ===== Imports =====

# ===== General =====
import numpy as np
import pandas as pd
import math
import warnings
warnings.filterwarnings('ignore')

# ===== Visualization =====
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import scipy.stats as stats
from matplotlib import patheffects
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.patches as mpatches
import matplotlib.colors as mcolors
import matplotlib.patheffects as path_effects
%matplotlib inline

# ===== Hypotheses testing =====
from scipy.stats import chi2_contingency
from scipy import stats
from scipy.stats import f_oneway
from scipy.stats import pearsonr

# ===== Preprocessing =====
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from category_encoders import CountEncoder
from sklearn.pipeline import Pipeline

# ===== Outlier Influence =====
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import statsmodels.api as sm

# ===== Imbalanced handling =====
from imblearn.over_sampling import SMOTE

# ===== Model Selection =====
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# ===== Evaluation Metrics =====
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import (mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, mean_absolute_percentage_error)
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import RandomizedSearchCV

1.2. Data Collection / Loading:¶

In [4]:
# ===== Load Data =====
df = pd.read_excel('Flight_Fare.xlsx')

# ===== Checking first five rows of dataset =====
df.head(5)
Out[4]:
Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price
0 IndiGo 24/03/2019 Banglore New Delhi BLR → DEL 22:20 01:10 22 Mar 2h 50m non-stop No info 3897
1 Air India 1/05/2019 Kolkata Banglore CCU → IXR → BBI → BLR 05:50 13:15 7h 25m 2 stops No info 7662
2 Jet Airways 9/06/2019 Delhi Cochin DEL → LKO → BOM → COK 09:25 04:25 10 Jun 19h 2 stops No info 13882
3 IndiGo 12/05/2019 Kolkata Banglore CCU → NAG → BLR 18:05 23:30 5h 25m 1 stop No info 6218
4 IndiGo 01/03/2019 Banglore New Delhi BLR → NAG → DEL 16:50 21:35 4h 45m 1 stop No info 13302
In [5]:
# ===== Checking last five rows of dataset =====
df.tail(5)
Out[5]:
Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time Duration Total_Stops Additional_Info Price
10678 Air Asia 9/04/2019 Kolkata Banglore CCU → BLR 19:55 22:25 2h 30m non-stop No info 4107
10679 Air India 27/04/2019 Kolkata Banglore CCU → BLR 20:45 23:20 2h 35m non-stop No info 4145
10680 Jet Airways 27/04/2019 Banglore Delhi BLR → DEL 08:20 11:20 3h non-stop No info 7229
10681 Vistara 01/03/2019 Banglore New Delhi BLR → DEL 11:30 14:10 2h 40m non-stop No info 12648
10682 Air India 9/05/2019 Delhi Cochin DEL → GOI → BOM → COK 10:55 19:15 8h 20m 2 stops No info 11753

1.3. Basic Overview:¶

In [6]:
# ===== Basic Overview =====

# ===== To view the summary stats of numerical columns =====
df.describe()
Out[6]:
Price
count 10683.000000
mean 9087.064121
std 4611.359167
min 1759.000000
25% 5277.000000
50% 8372.000000
75% 12373.000000
max 79512.000000
In [7]:
# ===== To View the categorical columns =====
df.describe(include='O').T
Out[7]:
count unique top freq
Airline 10683 12 Jet Airways 3849
Date_of_Journey 10683 44 18/05/2019 504
Source 10683 5 Delhi 4537
Destination 10683 6 Cochin 4537
Route 10682 128 DEL → BOM → COK 2376
Dep_Time 10683 222 18:55 233
Arrival_Time 10683 1343 19:00 423
Duration 10683 368 2h 50m 550
Total_Stops 10682 5 1 stop 5625
Additional_Info 10683 10 No info 8345

1.4. Dataset Information:¶

1.4.1. Information¶

In [8]:
# ===== Checking the info of dataset =====
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB
In [9]:
# ===== Checking the no. of rows and columns =====
df.shape
Out[9]:
(10683, 11)

1.4.2. Domain Analysis:¶

In [10]:
# ===== Domain Analysis =====
df.columns
Out[10]:
Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

Domain Analysis Report:¶

Feature No. Feature Name Type Description / Categories
1 Airline Categorical (object) Types of airlines (e.g., Indigo, Jet Airways, Air India, etc.)
2 Date_of_Journey Date/Time (object → datetime) Journey start date of the passenger
3 Source Categorical (object) Starting location of the journey
4 Destination Categorical (object) Destination location of the journey
5 Route Categorical (object) Route taken from source to destination
6 Dep_Time Time (object → datetime/time) Departure time of the flight
7 Arrival_Time Time (object → datetime/time) Arrival time at the destination
8 Duration String → Numeric (minutes/hours) Total travel time of the flight
9 Total_Stops Categorical (object) Number of stops in the journey
10 Additional_Info Categorical (object) Extra details (e.g., food, baggage, amenities)
11 Price Numerical (int64) Total ticket price (target variable)

Observation:

Column Dtype Notes
Airline object categorical
Date_of_Journey object categorical to should be datetime
Source object categorical
Destination object categorical
Route object categorical
Dep_Time object categorical to should be datetime/time
Arrival_Time object categorical to should be datetime/time
Duration object categorical to numeric (hours/minutes)
Total_Stops object categorical
Additional_Info object categorical
Price int64 numeric target variable

1. Date_of_Journey → datetime

  • Raw form = string ("24/03/2019") → ML models cannot learn from plain text dates.

  • Converting to datetime64 allows you to extract useful patterns:

  • Day of journey (weekend vs weekday)

  • Month (seasonal trends: holidays, festivals, peak travel)

  • Weekday (Monday vs Friday flights differ in price)

  • Helps the model capture temporal seasonality.

2. Dep_Time → datetime / numeric

  • Raw form = string ("22:20") → not usable by ML directly.

  • Converting to time gives features like:

  • Departure minutes from midnight (numeric)

  • Or hour of departure (morning vs evening flights → price difference)

  • Captures time-of-day effect on fares.

3. Arrival_Time → datetime / numeric

  • Raw form = string ("01:10") → again, string is useless.

  • Converted into minutes since midnight (or with next-day correction).

  • Tells the model whether flights arrive at odd hours vs peak hours (affects ticket cost).

  • Captures arrival convenience factor.

4. Duration → numeric (hours/minutes)

  • Raw form = string ("22h 20m") → ML models cannot parse text like "h" or "m".

  • Converted into total minutes (e.g., 1340).

  • Flight duration is one of the strongest predictors of price.

  • Converts unstructured text into a continuous variable.

1.4.3. Change the dtypes and column names:¶

In [11]:
# ===== Change the dtypes and column names =====
# ===== Convert Date_of_Journey =====
df['Date_of_Journey'] = pd.to_datetime(df['Date_of_Journey'], format='%d/%m/%Y', errors='coerce')
df['Journey_day'] = df['Date_of_Journey'].dt.day
df['Journey_month'] = df['Date_of_Journey'].dt.month
df['Journey_weekday'] = df['Date_of_Journey'].dt.weekday   # Monday=0, Sunday=6

# ===== Departure Time → minutes since midnight =====
df['Dep_Time'] = pd.to_datetime(df['Dep_Time'], format='%H:%M', errors='coerce')
df['Dep_minutes'] = df['Dep_Time'].dt.hour * 60 + df['Dep_Time'].dt.minute

# ===== Arrival Time → minutes since midnight (with next-day correction) =====
def convert_arrival(x):
    try:
        # Remove "Next Day" if present
        if " " in x:
            time_part = x.split(" ")[0]
            next_day = True
        else:
            time_part = x
            next_day = False

        t = pd.to_datetime(time_part, format='%H:%M', errors='coerce')
        if pd.isna(t):
            return np.nan
        minutes = t.hour * 60 + t.minute
        if next_day:
            minutes += 24*60   # ===== add 1440 minutes for next-day arrival =====
        return minutes
    except:
        return np.nan

df['Arrival_minutes'] = df['Arrival_Time'].apply(convert_arrival)

# ===== Duration → total minutes =====
def convert_duration(x):
    try:
        h, m = 0, 0
        if 'h' in x:
            h = int(x.split('h')[0].strip())
            x = x.split('h')[1]
        if 'm' in x:
            m = int(x.split('m')[0].strip())
        return h*60 + m
    except:
        return np.nan

df['Duration_minutes'] = df['Duration'].apply(convert_duration)

# ===== Final Cleanup (Drop Original Columns) =====
df = df.drop(['Date_of_Journey','Dep_Time','Arrival_Time','Duration'], axis=1)

df.head()
Out[11]:
Airline Source Destination Route Total_Stops Additional_Info Price Journey_day Journey_month Journey_weekday Dep_minutes Arrival_minutes Duration_minutes
0 IndiGo Banglore New Delhi BLR → DEL non-stop No info 3897 24 3 6 1340 1510 170
1 Air India Kolkata Banglore CCU → IXR → BBI → BLR 2 stops No info 7662 1 5 2 350 795 445
2 Jet Airways Delhi Cochin DEL → LKO → BOM → COK 2 stops No info 13882 9 6 6 565 1705 1140
3 IndiGo Kolkata Banglore CCU → NAG → BLR 1 stop No info 6218 12 5 6 1085 1410 325
4 IndiGo Banglore New Delhi BLR → NAG → DEL 1 stop No info 13302 1 3 4 1010 1295 285
In [12]:
# ===== Checking the info of dataset =====
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Airline           10683 non-null  object
 1   Source            10683 non-null  object
 2   Destination       10683 non-null  object
 3   Route             10682 non-null  object
 4   Total_Stops       10682 non-null  object
 5   Additional_Info   10683 non-null  object
 6   Price             10683 non-null  int64 
 7   Journey_day       10683 non-null  int32 
 8   Journey_month     10683 non-null  int32 
 9   Journey_weekday   10683 non-null  int32 
 10  Dep_minutes       10683 non-null  int32 
 11  Arrival_minutes   10683 non-null  int64 
 12  Duration_minutes  10683 non-null  int64 
dtypes: int32(4), int64(3), object(6)
memory usage: 918.2+ KB
  • The dataset contains 10,683 flight records with 13 columns.

  • Target variable: Price (int64) — flight ticket price.

  • Categorical features: Airline, Source, Destination, Route, Total_Stops, Additional_Info (all object).

  • Date/time features have been transformed into numeric: Journey_day, Journey_month, Journey_weekday, Dep_minutes, Arrival_minutes, Duration_minutes.

  • Clean Dataset – Only Route and Total_Stops have 1 missing value each (easy to fix). Otherwise, the dataset is complete and consistent, suitable for predictive modeling.

In [13]:
# ===== Checking the no. of rows and columns =====
df.shape
Out[13]:
(10683, 13)

2. Data wrangling / Cleaning¶

2.1. Extracting categorical and numerical columns¶

In [14]:
# ===== Extracting categorical and numerical columns =====
cat_col = [col for col in df.columns if df[col].dtype == 'object']
num_col = [col for col in df.columns if df[col].dtype != 'object']

# ===== Looking at unique values in categorical and numerical columns =====
print("Categorical Columns:\n")
for col in cat_col:
    print(f'\n{col}:\n{df[col].unique()}')

print("\nNumerical Columns:\n")
for col in num_col:
    print(f'\n{col}:\n{df[col].unique()}')
Categorical Columns:


Airline:
['IndiGo' 'Air India' 'Jet Airways' 'SpiceJet' 'Multiple carriers' 'GoAir'
 'Vistara' 'Air Asia' 'Vistara Premium economy' 'Jet Airways Business'
 'Multiple carriers Premium economy' 'Trujet']

Source:
['Banglore' 'Kolkata' 'Delhi' 'Chennai' 'Mumbai']

Destination:
['New Delhi' 'Banglore' 'Cochin' 'Kolkata' 'Delhi' 'Hyderabad']

Route:
['BLR → DEL' 'CCU → IXR → BBI → BLR' 'DEL → LKO → BOM → COK'
 'CCU → NAG → BLR' 'BLR → NAG → DEL' 'CCU → BLR' 'BLR → BOM → DEL'
 'DEL → BOM → COK' 'DEL → BLR → COK' 'MAA → CCU' 'CCU → BOM → BLR'
 'DEL → AMD → BOM → COK' 'DEL → PNQ → COK' 'DEL → CCU → BOM → COK'
 'BLR → COK → DEL' 'DEL → IDR → BOM → COK' 'DEL → LKO → COK'
 'CCU → GAU → DEL → BLR' 'DEL → NAG → BOM → COK' 'CCU → MAA → BLR'
 'DEL → HYD → COK' 'CCU → HYD → BLR' 'DEL → COK' 'CCU → DEL → BLR'
 'BLR → BOM → AMD → DEL' 'BOM → DEL → HYD' 'DEL → MAA → COK' 'BOM → HYD'
 'DEL → BHO → BOM → COK' 'DEL → JAI → BOM → COK' 'DEL → ATQ → BOM → COK'
 'DEL → JDH → BOM → COK' 'CCU → BBI → BOM → BLR' 'BLR → MAA → DEL'
 'DEL → GOI → BOM → COK' 'DEL → BDQ → BOM → COK' 'CCU → JAI → BOM → BLR'
 'CCU → BBI → BLR' 'BLR → HYD → DEL' 'DEL → TRV → COK'
 'CCU → IXR → DEL → BLR' 'DEL → IXU → BOM → COK' 'CCU → IXB → BLR'
 'BLR → BOM → JDH → DEL' 'DEL → UDR → BOM → COK' 'DEL → HYD → MAA → COK'
 'CCU → BOM → COK → BLR' 'BLR → CCU → DEL' 'CCU → BOM → GOI → BLR'
 'DEL → RPR → NAG → BOM → COK' 'DEL → HYD → BOM → COK'
 'CCU → DEL → AMD → BLR' 'CCU → PNQ → BLR' 'BLR → CCU → GAU → DEL'
 'CCU → DEL → COK → BLR' 'BLR → PNQ → DEL' 'BOM → JDH → DEL → HYD'
 'BLR → BOM → BHO → DEL' 'DEL → AMD → COK' 'BLR → LKO → DEL'
 'CCU → GAU → BLR' 'BOM → GOI → HYD' 'CCU → BOM → AMD → BLR'
 'CCU → BBI → IXR → DEL → BLR' 'DEL → DED → BOM → COK'
 'DEL → MAA → BOM → COK' 'BLR → AMD → DEL' 'BLR → VGA → DEL'
 'CCU → JAI → DEL → BLR' 'CCU → AMD → BLR' 'CCU → VNS → DEL → BLR'
 'BLR → BOM → IDR → DEL' 'BLR → BBI → DEL' 'BLR → GOI → DEL'
 'BOM → AMD → ISK → HYD' 'BOM → DED → DEL → HYD' 'DEL → IXC → BOM → COK'
 'CCU → PAT → BLR' 'BLR → CCU → BBI → DEL' 'CCU → BBI → HYD → BLR'
 'BLR → BOM → NAG → DEL' 'BLR → CCU → BBI → HYD → DEL' 'BLR → GAU → DEL'
 'BOM → BHO → DEL → HYD' 'BOM → JLR → HYD' 'BLR → HYD → VGA → DEL'
 'CCU → KNU → BLR' 'CCU → BOM → PNQ → BLR' 'DEL → BBI → COK'
 'BLR → VGA → HYD → DEL' 'BOM → JDH → JAI → DEL → HYD'
 'DEL → GWL → IDR → BOM → COK' 'CCU → RPR → HYD → BLR' 'CCU → VTZ → BLR'
 'CCU → DEL → VGA → BLR' 'BLR → BOM → IDR → GWL → DEL'
 'CCU → DEL → COK → TRV → BLR' 'BOM → COK → MAA → HYD' 'BOM → NDC → HYD'
 'BLR → BDQ → DEL' 'CCU → BOM → TRV → BLR' 'CCU → BOM → HBX → BLR'
 'BOM → BDQ → DEL → HYD' 'BOM → CCU → HYD' 'BLR → TRV → COK → DEL'
 'BLR → IDR → DEL' 'CCU → IXZ → MAA → BLR' 'CCU → GAU → IMF → DEL → BLR'
 'BOM → GOI → PNQ → HYD' 'BOM → BLR → CCU → BBI → HYD' 'BOM → MAA → HYD'
 'BLR → BOM → UDR → DEL' 'BOM → UDR → DEL → HYD' 'BLR → VGA → VTZ → DEL'
 'BLR → HBX → BOM → BHO → DEL' 'CCU → IXA → BLR' 'BOM → RPR → VTZ → HYD'
 'BLR → HBX → BOM → AMD → DEL' 'BOM → IDR → DEL → HYD' 'BOM → BLR → HYD'
 'BLR → STV → DEL' 'CCU → IXB → DEL → BLR' 'BOM → JAI → DEL → HYD'
 'BOM → VNS → DEL → HYD' 'BLR → HBX → BOM → NAG → DEL' nan
 'BLR → BOM → IXC → DEL' 'BLR → CCU → BBI → HYD → VGA → DEL'
 'BOM → BBI → HYD']

Total_Stops:
['non-stop' '2 stops' '1 stop' '3 stops' nan '4 stops']

Additional_Info:
['No info' 'In-flight meal not included' 'No check-in baggage included'
 '1 Short layover' 'No Info' '1 Long layover' 'Change airports'
 'Business class' 'Red-eye flight' '2 Long layover']

Numerical Columns:


Price:
[ 3897  7662 13882 ...  9790 12352 12648]

Journey_day:
[24  1  9 12 27 18  3 15  6 21]

Journey_month:
[3 5 6 4]

Journey_weekday:
[6 2 4 0 1 5 3]

Dep_minutes:
[1340  350  565 1085 1010  540 1135  480  535  685  585 1220  700 1270
 1035 1000  525  840 1215  960  850 1320  240 1285 1310  420  425  590
  875  635  905  855  405 1255  670  345 1140 1385  660  575 1275 1435
 1185  530  940  365  900  835  355  800  305  385 1050  500 1195  390
  845  120  580  505 1225  795  135 1015 1245  315 1190 1200  370 1170
  285  775 1095 1040  925 1380  720  885  710  690  880 1150  360 1410
  455  785  750  910  770 1105  990   40  410  780 1155   90 1020  600
 1175  930  730  970 1235 1345 1265  335  310  400  915   30  510  430
  330  865  325  620 1065  790 1330  295 1070 1280  380  955 1230 1045
  570  450  155  655 1030  550 1125  920 1370  895  860  805 1335  665
  975 1210  415 1145  475  465  610  495  695 1260 1075 1005 1100  230
  515 1160 1205 1060  280 1055  595  300 1080  175 1240 1375 1360 1290
  490 1025  445  945  555  950  705 1325 1115   25 1180 1250 1365  630
 1405  715  645  675  740  870  435   95 1120  560 1315  830  100   20
  255  825 1110  375  125  735  810  395  605  520  185 1295  995  150
  985  340  935  820  440  290  765  625  725  680 1300  180]

Arrival_minutes:
[1510  795 1705 1410 1295  685 2065 1745 1155 1380 1375  835 2000 1190
 2595 2195 1160 2240  410 1310 1935 1655  775  755  560 1395 1260 1220
 1005 1140 1090  915 2580  515 2105  855 2095 1455 1605 2160 1435 1530
 2445 2005  925  935  510  605  870  475  635 1355 1720 2125  445 1245
 2305 1350  800 1400  460 1370 2315  535 1130 1985 1535  580  985 2045
 2110 1025 1070 1390  525 1905  970 1495 2855 1200 1480 2570 1915 1305
 2535  205 2235  255 1270 1265 2025 1085 1850 1320 2355  720  930 1900
  530  480 1175 1415  680  615  765 1580  435  710 2550 1185 2625 1280
 1205  550 2630 1340  610 1500 1430  740 1225 1365 1285  845 1445 1195
 2420  425 1150  725  245 1180 1075 1045 1470 1420 1385 1490 2265 1995
 1235 1870  575  860 1975 1725  630 1165 1775 1980 1360  885  805 1315
 2085  640  780 2770  980 1115 2800 1765 1335  730  735  840 1575  770
 1035  455 1425 1330 2660 1345  745 1835 1520 1125  465  570  265 2590
  260  645 1095 2090 2845 1325 2175 2080  810 2835  815 2815 1055  500
 1105 1990  620 2530  950  690 1405  850 1465 2685  195  955 2705 2820
 1485 1715 2520 2190 1570 1460 1300 2840  520 1015 1110 2185 1230 2795
 2665 1040 1475 1215  750  890  820 2450  865 2010 1080 2165 2075 2345
 1065 2560 2015 1970  190 2440  590 2620 2410 2720  875 1885  785 1060
 2680 2220 1030 1250 2120 1450 1930 2415 2765  965 2040  995 2525 2070
 2540  240 1275  880 1100  975 1545 2430 2155 2030 2260 2055 2360 2545
  430 1210  900 2605  945  760  910 2875 2480  625  790 1240  675  285
 1660 2275  600 2635 2565  960 2135 1965]

Duration_minutes:
[ 170  445 1140  325  285  145  930 1265 1530  470  795  155  135  730
 1595  270 1355 1380 1235  310  920  175  800  910  345  355  805 1320
  330  625  315  150  375  715  665  510 1325  165  720  965 1195  195
 1520  180  975  905  390 1505  745 1640  615  630   90   85 1590  440
  810  300 1145  890  160 1330  575  600 1280 1125  740 1080  555 1050
  995  735  450 1440  535  430  870 1820  900  765  610  925  845 1215
 1390 1090  960  140  480 1015  190  840 1430 1300 1275  650  495  515
  710 1655  505 1255  290  490 1465 1415 1545 1570 1730 1515  560  550
  185  690  570 1055  305 1550 1200  780 1105 1450  295 1535  380 1120
 1165 1760  545  645  700 1375 2245 1540  835  520 1410  755 1455   80
  660  675  875  775  540  460  705 1495 1025 1795 1335  880  435 1210
 1245 1620 1470 1225  335  885  340  245  955  465 1700  260  220  530
 1425 1485 1295  485  385  950 1585 1490 1560 1385  475 1580 1395  320
  240  585  500 1045  425 2045  365  350  420  265  825 1155 1350  985
  830 1625 1690  280  940  275 1110 2295  395  750  680  455 1775 1615
 1420  770  590 1315  655 1270 1240 1800  790  525  370 1065 1305  235
 1040 1830 1285  760 1475 1150 1360  895 1260  405 1720  580 1000  980
 1005   75  415  685  860  725 1445 1695 1070 1220 1685  620  855 2115
 2135 1600 1680  865  785 2240 2170 1555 2105 1185 1675 2820  635   95
  970 2300  360 1010  850 1400 1060  695 1100  400 1855 1480 1790 1705
 1035 1365 1525 1310 1995 1815  215 1660 1825 1130 1665  915  640 1575
 2185 1610  945 1180 1345 1175 1500 1605 2280  255 1510 1095  410 1435
 1075 1405 1030 1460 1710 1630 1160  935  565 1290 2065 1115 1780 1565
 1745 1645  990  670 1735 1750 2040 1840 1845 1975  605 2120 1925 1900
 1190 2025 1810  820 1170 1890 2070 1670 2315 2525  250 2345  230    5
 1950 1915 2000 1650 1135  595 2480 1205 1910 2565  205 2230 1770 1940
 1250 2420  815 2860]

Observations from Categorical Columns

Column Unique Values (Sample) Count of Unique Values Notes
Airline IndiGo, Air India, Jet Airways, SpiceJet, Multiple carriers, GoAir, Vistara, Air Asia, … 12 Multiple airlines including economy, premium, and business classes.
Source Banglore, Kolkata, Delhi, Chennai, Mumbai 5 Major metro cities as departure points.
Destination New Delhi, Banglore, Cochin, Kolkata, Delhi, Hyderabad 6 Key arrival cities, mix of metros and tier-2.
Route BLR → DEL, CCU → IXR → BBI → BLR, DEL → LKO → BOM → COK, CCU → NAG → BLR, BLR → NAG → DEL, … 90+ (many combinations) Complex routes with 1–4 layovers; some missing values (NaN).
Total_Stops non-stop, 1 stop, 2 stops, 3 stops, 4 stops, NaN 6 Indicates layovers; missing values present.
Additional_Info No info, In-flight meal not included, No check-in baggage included, 1 Short layover, Change airports, Business class, Red-eye … 10 Mix of service details; “No info/No Info” redundancy observed.

Observations from Numerical Columns

Column Range (Min–Max) Count of Unique Values Notes
Price 3,897 – 13,882 Many (continuous) Ticket price; continuous variable, positively skewed (higher fares exist).
Journey_day 1 – 31 10 Represents day of month extracted from journey date.
Journey_month 3 – 6 4 Only months March → June are present in dataset.
Journey_weekday 0 – 6 7 Encoded as 0=Monday … 6=Sunday. Covers all days of week.
Dep_minutes 0 – 1435 Many (≈ 200+) Departure time converted into minutes of day (00:00 → 23:59).
Arrival_minutes 20 – 2875 Many (≈ 300+) Arrival time in minutes; >1440 means next-day or 2-day flights.
Duration_minutes 5 – 2860 Many (≈ 300+) Flight duration; highly variable, indicates layovers (short = direct, long = multi-stop).

2.2. Observations from Categorical Columns and Imputation¶

In [15]:
# ===== Count the number of unique values =====
for col in cat_col:
    print(f"Column: '{col}'")
    print(f" * Unique Categories: {df[col].nunique()}")
    print(f" * Category Distribution:\n{df[col].value_counts(dropna=False)}")
    print("-" * 30)
Column: 'Airline'
 * Unique Categories: 12
 * Category Distribution:
Airline
Jet Airways                          3849
IndiGo                               2053
Air India                            1752
Multiple carriers                    1196
SpiceJet                              818
Vistara                               479
Air Asia                              319
GoAir                                 194
Multiple carriers Premium economy      13
Jet Airways Business                    6
Vistara Premium economy                 3
Trujet                                  1
Name: count, dtype: int64
------------------------------
Column: 'Source'
 * Unique Categories: 5
 * Category Distribution:
Source
Delhi       4537
Kolkata     2871
Banglore    2197
Mumbai       697
Chennai      381
Name: count, dtype: int64
------------------------------
Column: 'Destination'
 * Unique Categories: 6
 * Category Distribution:
Destination
Cochin       4537
Banglore     2871
Delhi        1265
New Delhi     932
Hyderabad     697
Kolkata       381
Name: count, dtype: int64
------------------------------
Column: 'Route'
 * Unique Categories: 128
 * Category Distribution:
Route
DEL → BOM → COK                      2376
BLR → DEL                            1552
CCU → BOM → BLR                       979
CCU → BLR                             724
BOM → HYD                             621
                                     ... 
BLR → HBX → BOM → NAG → DEL             1
NaN                                     1
BLR → BOM → IXC → DEL                   1
BLR → CCU → BBI → HYD → VGA → DEL       1
BOM → BBI → HYD                         1
Name: count, Length: 129, dtype: int64
------------------------------
Column: 'Total_Stops'
 * Unique Categories: 5
 * Category Distribution:
Total_Stops
1 stop      5625
non-stop    3491
2 stops     1520
3 stops       45
NaN            1
4 stops        1
Name: count, dtype: int64
------------------------------
Column: 'Additional_Info'
 * Unique Categories: 10
 * Category Distribution:
Additional_Info
No info                         8345
In-flight meal not included     1982
No check-in baggage included     320
1 Long layover                    19
Change airports                    7
Business class                     4
No Info                            3
1 Short layover                    1
Red-eye flight                     1
2 Long layover                     1
Name: count, dtype: int64
------------------------------

Observations from Categorical Columns¶

1. Null Counts:¶

Column NaN Count % of Data Recommended Handling
Airline 0 0% No action needed.
Source 0 0% No action needed.
Destination 0 0% No action needed.
Route 1 ~0.009% Drop the row (too small)
Total_Stops 1 ~0.009% Drop the row (too small)
Additional_Info 0 0% No action needed.
Price 0 0% No action needed.

2. Need to name change:¶

Destination

  • Issue: "Delhi" (1265) and "New Delhi" (932) represent the same city.
  • Merge them under a single consistent name, preferably "Delhi"

Additional_Info

  • "No info" (8345) and "No Info" (3) are duplicates with different capitalization.

  • Standardize them into "No info"

In [16]:
# ===== Imputation of categorical features =====
# ===== Drop rare NaN rows in Route and Total_Stops =====
df = df.dropna(subset=['Route', 'Total_Stops'])

# ===== Fix naming issues in Destination =====
df['Destination'] = df['Destination'].replace({'New Delhi': 'Delhi'})

# ===== Fix naming issues in Additional_Info =====
df['Additional_Info'] = df['Additional_Info'].replace({'No Info': 'No info'})

2.3. Check for and remove duplicate values¶

In [17]:
# ===== Check duplicate values =====
# ===== Total number of rows =====
total_rows = len(df)

# ===== Count duplicate rows =====
duplicate_count = df.duplicated().sum()

# ===== Percentage of duplicates =====
duplicate_percentage = (duplicate_count / total_rows) * 100

print(f"Total Rows: {total_rows}")
print(f"Duplicate Rows: {duplicate_count}")
print(f"Percentage of Duplicates: {duplicate_percentage:.2f}%")
Total Rows: 10682
Duplicate Rows: 222
Percentage of Duplicates: 2.08%
In [18]:
# ===== Drop exact duplicates =====
df = df.drop_duplicates()
  • The dataset initially consisted of 10,682 rows and 13 columns, out of which 222 rows (2.08%) were identified as exact duplicates. Since these duplicate records did not provide any additional insights and could potentially bias the analysis, they were removed. Additionally, a few rows containing missing values in the Route and Total_Stops columns (~0.009% each) were dropped, as their proportion was negligible. To improve consistency, categorical labels were also standardized, such as merging “Delhi” and “New Delhi” under a single category, and correcting capitalization differences in “No info” vs. “No Info”.

  • After these cleaning steps, the dataset was reduced to 10,460 unique records and 13 columns, ensuring higher data quality and reliability for further analysis.

3. Task 1 - Exploratory Data Analysis (EDA)¶

3.1. Univariate Analysis: Investigating Individual Features¶

3.1.1. Categorical Features¶

Chart-1. Distribution of Categorical Features¶

In [19]:
# ===== Categorical Features =====
# ===== Select categorical columns and exclude 'Route' =====
categorical_cols = df.select_dtypes(include='object').columns
categorical_cols = [col for col in categorical_cols if col != 'Route']

# ===== Subplot grid =====
n_cols = 2
n_rows = (len(categorical_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(22, 5*n_rows))
axes = axes.flatten()

# ===== Main title =====
fig.suptitle('Distribution of Categorical Features (Excluding Route)', fontsize=22, fontweight='bold', y=0.98)

# ===== Background color =====
bg_color = '#EDEDED'

# ===== Maroon → Golden Gradient =====
colors_list = ['#FFD700', '#800000']  # Golden → Maroon
custom_cmap = LinearSegmentedColormap.from_list('gold_maroon', colors_list)

# ===== Loop through categorical columns =====
for i, col in enumerate(categorical_cols):
    axes[i].set_facecolor(bg_color)
    axes[i].grid(axis='x', linestyle='--', alpha=0.4, zorder=0)
    axes[i].set_title(f'{col}', fontsize=16, fontweight='bold', color='#222222')
    axes[i].set_xlabel('Count', fontsize=12)

    # ===== All categories sorted ascending for horizontal bars =====
    ctab = df[col].value_counts().sort_values(ascending=True)
    categories = ctab.index
    values = ctab.values

    # ===== Gradient colors proportional to values =====
    norm_values = (values - values.min()) / (values.max() - values.min())
    colors = [custom_cmap(v) for v in norm_values]

    # ===== Horizontal bar plot =====
    bars = axes[i].barh(categories, values, color=colors, edgecolor='#333333', linewidth=0.8, zorder=2)

    # ===== Add counts elegantly =====
    for bar, val in zip(bars, values):
        axes[i].text(val + max(values)*0.01, bar.get_y() + bar.get_height()/2,
                     f"{val}", va='center', fontsize=10, fontweight='bold', color='#222222')

    axes[i].tick_params(axis='y', labelsize=10)

# ===== Remove empty subplots =====
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.subplots_adjust(top=0.92)
plt.show()
No description has been provided for this image

Categorical Feature Observations¶

1. Airline Distribution

  • Jet Airways dominates the dataset with 3700 flights, followed by IndiGo (2043) and Air India (1694).

  • Airlines like SpiceJet (815), Vistara (477), and Air Asia (318) are present in smaller proportions.

  • Premium categories such as Jet Airways Business (6), Vistara Premium Economy (3), and Trujet (1) are very rare.

  • Insight: The dataset is heavily skewed towards Jet Airways and IndiGo, meaning models may learn price patterns biased towards these airlines. Rare airlines may not significantly impact predictions.

2. Source Distribution

  • Delhi (4345) is the most common source city, followed by Kolkata (2860) and Bangalore (2177).

  • Mumbai (697) and Chennai (381) have fewer entries.

  • Insight: Most flights in the dataset originate from Delhi, indicating it’s a major hub. Chennai contributes the least data, meaning fewer insights for Chennai-origin flights.

3. Destination Distribution

  • Cochin (4345) is the top destination, followed by Bangalore (2860) and Delhi (2177).

  • Hyderabad (697) and Kolkata (381) are much smaller in count.

  • Insight: Cochin is the most frequent destination in this dataset, showing strong traffic towards it.

4. Total Stops Distribution

  • 1 stop (5625) is the most frequent, followed by non-stop (3473).

  • 2 stops (1318) exist but are much less common.

  • 3 stops (43) and 4 stops (1) are extremely rare.

  • Insight: Majority of flights are 1 stop or non-stop, meaning longer layovers are rare. This feature is highly imbalanced.

5. Additional Info Distribution

  • The majority of records have "No info" (8183), followed by "In-flight meal not included" (1926).

  • Other categories like "No check-in baggage included (318)", "1 Long layover (19)", and "Change airports (7)" are very rare.

  • Insight: The “Additional Info” feature is mostly uninformative since 98%+ of entries are just “No info” or “In-flight meal not included”. The rare categories will have minimal impact.

Overall Insights:

  • The dataset is imbalanced across categories, especially in airlines, source, and stops.

  • Jet Airways, Delhi (Source), and Cochin (Destination) dominate the dataset.

  • 1-stop and non-stop flights cover the majority of the records.

  • Additional_Info column has limited variation and may not add much predictive power.

Chart-2. Distribution of Route Categorical Feature¶

In [20]:
# ===== Categorical Features(Route) =====
column_to_plot = 'Route'
top_n = 10
bg_color = '#EDEDED'
colors_list = ['#FFD700', '#800000']  # ===== Golden → Maroon gradient =====
custom_cmap = LinearSegmentedColormap.from_list('gold_maroon', colors_list)

# ===== Prepare data =====
ctab = df[column_to_plot].value_counts()
top_ctab = ctab.nlargest(top_n)
other_count = ctab.iloc[top_n:].sum()

# ===== Combine top categories and 'Other' =====
top_ctab['Other'] = other_count
categories = top_ctab.index
values = top_ctab.values

# ===== Gradient colors proportional to values =====
norm_values = (values - values.min()) / (values.max() - values.min())
colors = [custom_cmap(v) for v in norm_values]

# ===== Plot =====
plt.figure(figsize=(14, 5))
plt.title(f'Distribution of {column_to_plot} (Top {top_n} + Other)', fontsize=20, fontweight='bold', y=1.03)
plt.gca().set_facecolor(bg_color)
plt.grid(axis='x', linestyle='--', alpha=0.4, zorder=0)

# ===== Horizontal bar plot =====
bars = plt.barh(categories, values, color=colors, edgecolor='#333333', linewidth=0.8, zorder=2)

# ===== Add counts at the end of bars =====
for bar, val in zip(bars, values):
    plt.text(val + max(values)*0.01, bar.get_y() + bar.get_height()/2,
             f"{val}", va='center', fontsize=10, fontweight='bold', color='#222222')

# ===== Axis labels =====
plt.xlabel('Count', fontsize=14)
plt.ylabel(column_to_plot, fontsize=14)
plt.yticks(fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image

Insights

  • Delhi and Bangalore dominate as major connecting hubs across multiple routes.

  • Cochin appears often as the final destination in the most common routes.

  • The top 2 routes alone (DEL → BOM → COK & BLR → DEL) contribute significantly to the dataset.

  • The long tail (Other = 2320 flights) indicates a high diversity of routes, but most are individually rare.

3.1.2. Visualize distributions of the numerical features¶

Chart-3. Visualize the distribution of numerical features¶

In [21]:
# ===== Distribution of numerical features =====
# ===== Set up =====
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")

# ===== Select numeric columns =====
numerics = df.select_dtypes(include='number')

# ===== Calculate grid dimensions for subplots - 4 columns per row =====
n_cols = 4
n_rows = (len(numerics.columns) + n_cols - 1) // n_cols

# ===== Create figure with subplots =====
fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 5*n_rows))
fig.suptitle('Distribution Analysis of Numerical Features',
             fontsize=20, fontweight='bold', y=1.05)

# ===== Flatten axes array for easier indexing =====
axes = axes.flatten()

# ===== Create histplot for each numeric column =====
for i, column in enumerate(numerics.columns):
    # ===== Skip if no more columns =====
    if i >= len(axes):
        break

    # ===== Get data for current column =====
    data = numerics[column].dropna()

    # ===== Create distplot with blue bars and green KDE =====
    sns.histplot(data, kde=True, ax=axes[i], color='#800000',
                 stat='density', alpha=0.7, bins=30)

    # ===== Get KDE line and color it green =====
    # ===== Check if KDE line exists =====
    if axes[i].get_lines():
        kde_line = axes[i].get_lines()[0]
        kde_line.set_color('#FFD700')
        kde_line.set_linewidth(2.5)
        kde_line.set_alpha(0.8)

    # ===== Add statistical information =====
    mean_val = data.mean()
    median_val = data.median()
    skewness = data.skew()
    kurtosis = data.kurtosis()

    # ===== Add vertical lines for mean and median =====
    axes[i].axvline(mean_val, color='blue', linestyle='--', linewidth=2,
                   label=f'Mean: {mean_val:.2f}')
    axes[i].axvline(median_val, color='green', linestyle='--', linewidth=2,
                   label=f'Median: {median_val:.2f}')

    # ===== Set title and labels with enhanced formatting =====
    axes[i].set_title(f'{column}\nSkew: {skewness:.2f} | Kurtosis: {kurtosis:.2f}',
                     fontweight='bold', pad=15)
    axes[i].set_xlabel('Value', fontweight='bold')
    axes[i].set_ylabel('Density', fontweight='bold')

    # ===== Add legend with better positioning =====
    axes[i].legend(loc='upper right', frameon=True, fancybox=True, shadow=True)

    # ===== Add a grid for better readability =====
    axes[i].grid(True, alpha=0.3, linestyle='--')

    # ===== Add a box with summary statistics =====
    textstr = f'n = {len(data):,}\nMin = {data.min():.2f}\nMax = {data.max():.2f}\nσ = {data.std():.2f}'
    props = dict(boxstyle='round', facecolor='lightblue', alpha=0.7, edgecolor='navy')
    axes[i].text(0.02, 0.98, textstr, transform=axes[i].transAxes, fontsize=9,
                verticalalignment='top', bbox=props, fontweight='bold')

    # ===== Set background color for subplot =====
    axes[i].set_facecolor('#f8f9fa')

# ===== Hide any empty subplots =====
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

# ===== Adjust layout with better spacing =====
plt.tight_layout()
plt.subplots_adjust(top=0.93, hspace=0.4, wspace=0.3)

# ===== Add a border around the entire figure =====
fig.patch.set_edgecolor('black')
fig.patch.set_linewidth(2)

plt.show()
No description has been provided for this image

Insights from Numerical Feature Distributions¶

1. Price

  • Range: ₹1759 – ₹79,512.

  • Mean (9027) > Median (8266) → Right-skewed distribution (skew = 1.86).

  • Heavy positive skew and high kurtosis (13.53) → presence of outliers (very expensive flights).

  • Insight: Most ticket prices are concentrated below ₹20,000, but a few extremely high fares create long tails.

2. Journey_day

  • Range: 1 – 27.

  • Fairly even spread across days, with no strong peaks.

  • Mean (13.46) ≈ Median (12) → almost symmetric.

  • Insight: Flights are fairly distributed across the month, with no strong day-of-month bias.

3. Journey_month

  • Range: March (3) – June (6).

  • Flights are concentrated in May and June.

  • Mean (4.7) ≈ Median (5) → nearly balanced.

  • Insight: Dataset covers only 4 months, with higher flight frequency in May & June (possible seasonal trend).

4. Journey_weekday

  • Range: 0 – 6 (Sunday–Saturday).

  • Distribution is fairly balanced across weekdays, with some variations.

  • Mean (2.93) ≈ Median (3) → symmetric.

  • Insight: Flights are not biased towards weekdays or weekends → fairly uniform distribution.

5. Dep_minutes (Departure Time in Minutes)

  • Range: 20 – 1435 minutes (~00:20 – 23:55).

  • Distribution shows multiple peaks, suggesting higher flight frequency in morning and evening.

  • Mean (773) ≈ Median (705) → nearly symmetric.

  • Insight: Peak departures are likely during morning and evening rush hours.

6. Arrival_minutes (Arrival Time in Minutes)

  • Range: 190 – 2875 minutes.

  • Mean (1398) > Median (1305) → slightly right-skewed.

  • Multiple peaks indicate popular arrival windows.

  • Insight: Arrival times are clustered around afternoon and late evening.

7. Duration_minutes

  • Range: 5 – 2860 minutes (~48 hours).

  • Mean (629) > Median (505) → right-skewed.

  • Most flights last < 1000 minutes (~16 hours), with few very long flights (possibly multi-stop).

  • Insight: Majority of flights are short-to-medium duration; very long flights are rare outliers.

Overall Insights:

  • Price and Duration are highly skewed → outliers must be treated for better model performance.

  • Journey_day, Journey_month, and Journey_weekday are fairly balanced, so time-based seasonal/weekly effects may be important predictors.

  • Departure & Arrival times show clear time-of-day peaks, which can be critical features in predicting price.

Numerical Feature Observations¶

Feature Skewness / Kurtosis Observation
Price Skew = 1.86 (Right-skewed), Kurtosis = 13.53 Most ticket prices are below ₹20,000, but extreme high fares create long tails (outliers).
Journey_day Skew = 0.12, Kurtosis = -1.27 Fairly uniform distribution across days of month; almost symmetric with no strong peaks.
Journey_month Skew = -0.38, Kurtosis = -1.32 Data spans March–June; flights concentrated in May & June → possible seasonal effect.
Journey_weekday Skew = 0.04, Kurtosis = -1.19 Balanced distribution across weekdays; no strong weekday vs weekend bias.
Dep_minutes (Departure Time) Skew = 0.12, Kurtosis = -1.19 Multiple peaks → higher departures during morning & evening rush hours.
Arrival_minutes Skew = 0.46, Kurtosis = -0.41 Slight right skew; arrivals mostly in afternoon & late evening with multiple peaks.
Duration_minutes Skew = 0.90, Kurtosis = -0.05 Right-skewed; most flights under 1000 mins (~16 hrs); very long flights are rare outliers.

3.1.3. Distribution of categorical features¶

Chart-4. Pie Chart Distribution of Categorical Features¶

In [22]:
# ===== Pie Chart Distribution of Categorical Features =====
plt.figure(figsize=(18, 14))
# ===== Clean style with grid =====
plt.style.use('seaborn-v0_8-whitegrid')

# ===== Define a smooth, elegant blue-green palette with accents =====
colors = ['#800000', '#FFD700', '#4B0082', '#FF4500',  '#2E8B57',  '#4682B4', '#DA70D6']

# ===== Select categorical columns =====
categorical_cols = df.select_dtypes(include='object')

for i, col in enumerate(categorical_cols):
    plt.subplot(4, 4, i + 1)

    # ===== Get value counts and handle potential many categories =====
    value_counts = df[col].value_counts(dropna=False)

    # ===== Group small categories into "Other" =====
    if len(value_counts) > 6:
        threshold = 0.05 * value_counts.sum()
        small_categories = value_counts[value_counts < threshold]
        if len(small_categories) > 0:
            value_counts = value_counts[value_counts >= threshold]
            value_counts['Other'] = small_categories.sum()

    labels = [str(x) for x in value_counts.index]
    sizes = value_counts.values

    # ===== Create dynamic colors =====
    n_categories = len(value_counts)
    chart_colors = [colors[j % len(colors)] for j in range(n_categories)]

    # ===== Plot donut chart with percentages inside =====
    wedges, texts, autotexts = plt.pie(
        sizes,
        # ===== Remove labels from outside =====
        labels=None,
        colors=chart_colors,
        autopct='%1.1f%%',
        startangle=90,
        pctdistance=1.1,
        labeldistance=1.2,
        wedgeprops={'edgecolor': 'white', 'linewidth': 2, 'alpha': 0.95},
        textprops={'fontsize': 9, 'weight': 'bold', 'color': 'black'}
    )

    # ===== Style percentages =====
    for autotext in autotexts:
        autotext.set_weight('bold')
        autotext.set_fontsize(10)

    # ===== Donut effect =====
    centre_circle = plt.Circle((0, 0), 0.60, fc='white')
    plt.gca().add_artist(centre_circle)

    # ===== Add legend instead of labels outside =====
    plt.legend(wedges, labels, title=col.title(), loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))

    # ===== Keep it circular =====
    plt.axis('equal')
    plt.title(col.title(), fontsize=12, weight='bold', pad=15)

plt.suptitle('Distribution of Categorical Variables', fontsize=20, weight='bold', y=0.98)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
No description has been provided for this image

1. Why did you pick the specific chart?

  • I used donut/pie charts because they are effective for showing the proportion of categories within each variable.

  • Since variables like Airline, Source, Destination, Route, Total Stops, and Additional Info are categorical, these charts clearly highlight which categories dominate and which are rare.

  • Donut charts make it easier for stakeholders to visually compare category shares at a glance, especially in datasets with imbalances.

2. What insights are found from the chart?

  • Airline: Jet Airways (35.4%) and IndiGo (19.5%) dominate, while smaller airlines contribute little.

  • Source & Destination: Delhi (41.5%) is the top source city, and Cochin (41.5%) is the top destination.

  • Route: The route DEL → BOM → COK (22.7%) and BLR → DEL (14.7%) are highly frequent, while many other routes are rare (35% grouped as “Other”).

  • Total Stops: Most flights are 1 stop (53.8%) or non-stop (33.2%); very few flights have 2+ stops.

  • Additional Info: Majority of flights provide “No info” (78.2%), with only 18.4% marked as “In-flight meal not included”.

3. Will the gained insights help create a positive business impact?

  • Yes, these insights are business-relevant:

  • Airlines & Routes: Businesses (airlines or travel agencies) can focus marketing and dynamic pricing strategies on popular carriers (Jet Airways, IndiGo) and high-frequency routes.

  • Sources & Destinations: Airports like Delhi (source) and Cochin (destination) * can plan better resource allocation (check-in counters, staff, baggage handling) to manage heavy traffic.

  • Stops: Highlighting non-stop and 1-stop flights in promotions can attract customers since they make up 87%+ of flights.

  • Additional Info: Since most flights don’t provide clear “additional info,” there’s an opportunity for airlines to differentiate through transparency (e.g., promoting baggage allowance, meals, business class perks).

3.2. Bivariate Analysis: Examining Relationships Between Variable Pairs¶

3.2.1. Regression plot of feature vs Target Variable¶

Chart-5. Regression plot of feature vs Target Variable¶

In [23]:
# ===== Regression plot of feature vs Target Variable =====
# ===== gray background =====
sns.set_theme(style="darkgrid")

# ===== Select numeric columns =====
numerics = df.select_dtypes(include='number').columns.tolist()

# ===== Define target column (Price) =====
target_col = 'Price'

# Remove target from feature list
if target_col in numerics:
    numerics.remove(target_col)

# ===== Copy numeric features and target column =====
numeric_df_copied = df[numerics + [target_col]].copy()

# ===== Drop missing values =====
numeric_df_copied = numeric_df_copied.dropna()

# ===== Sample data for faster plotting =====
if len(numeric_df_copied) > 5000:
    numeric_df_copied = numeric_df_copied.sample(5000, random_state=42)

# ===== Setup subplots =====
n_cols = 4
n_rows = int(np.ceil(len(numerics) / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(22, 18))
axes = axes.flatten()

fig.suptitle('Regression Plots: Numeric Features vs Flight Price',
             fontsize=24, fontweight='bold', y=0.98)

for i, column in enumerate(numerics):
    ax = axes[i]

    # ===== Scatter + Regression line =====
    sns.regplot(
        data=numeric_df_copied,
        x=column,
        y=target_col,
        scatter_kws={'alpha':0.5, 's':30, 'color':'#FFD700', 'edgecolor':'white'},
        line_kws={'color':'#800000', 'linewidth':2},
        ci=95,
        ax=ax
    )

    # ===== Customize =====
    ax.set_title(column.title(), fontsize=14, fontweight='bold')
    ax.set_xlabel(column.title(), fontsize=12, fontweight='bold')
    ax.set_ylabel('Flight Price (₹)', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.5)

# ===== Remove empty subplots if extra =====
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
No description has been provided for this image

Observations: Regression Plots (Numeric Features vs Flight Price)

Feature Observation
Journey_Day No strong trend. Flight price remains scattered across all days. Slight downward slope suggests prices may be marginally lower mid/late month.
Journey_Month No clear relationship. Price distribution is similar across months.
Journey_Weekday No strong impact. Prices are spread throughout the week, with minor variations.
Dep_Minutes Weak relationship. Departure time (in minutes) doesn’t strongly influence price. Prices remain scattered throughout the day.
Arrival_Minutes Slight positive trend. Later arrival times are weakly associated with higher prices.
Duration_Minutes Strongest positive correlation. Longer flight durations clearly lead to higher prices. This feature is highly significant for prediction.

1. Why did you pick the specific chart?

  • I used regression plots because they visually represent the relationship between numeric features and the target variable (flight price). This allows us to see both the trend (line fit) and the spread of data (scatter points) in a single chart, making it easier to identify which features have predictive power.

2. What is/are the insight(s) found from the chart?

  • Duration_Minutes has a strong positive correlation with price → longer flights are more expensive.

  • Arrival_Minutes shows a slight upward trend → later arrivals may cost more.

  • Journey_Day, Journey_Month, Journey_Weekday → minimal impact, prices remain widely scattered.

  • Dep_Minutes → weak influence, departure time alone doesn’t drive price significantly.

3. Will the gained insights help create a positive business impact?

  • Yes. These insights directly impact business strategy:

  • Airlines can adjust pricing strategies by accounting for flight duration, which is the strongest cost driver.

  • Marketing and discount campaigns can focus on features with weaker effects (like weekdays or departure times) to attract more customers without significantly affecting revenue.

  • Customers can be better informed about why longer flights are priced higher, improving transparency and trust.

  • Overall, these insights support more accurate price prediction models and smarter revenue management, leading to a positive business impact.

3.2.2. Airline segmentation analysis of the price variable¶

Chart-6. CountPlot for Airline segmentation analysis of the price variable¶

In [24]:
# ===== Visualization code =====
# ===== Bin Price into categories =====
df_air = df.copy()
bins = [0, 5000, 10000, 15000, 20000, df_air['Price'].max()]
labels = ['0-5k', '5k-10k', '10k-15k', '15k-20k', '20k+']
df_air['Price_Range'] = pd.cut(df_air['Price'], bins=bins, labels=labels)

plt.figure(figsize=(20,10))
sns.countplot(
    data=df_air,
    x='Airline',
    hue='Price_Range',
    palette=['#800000', '#FFD700', 'red', 'navy', 'green'],
    edgecolor='black'
)

plt.title("Airline Segmentation by Price Ranges", fontsize=18, weight='bold')
plt.xlabel("Airline", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.yticks(fontsize=10)
plt.legend(title='Price Range (INR)', title_fontsize=12, fontsize=10, loc='upper right')
plt.grid(axis='y', linestyle='--', alpha=0.4)

plt.tight_layout()
plt.show()
No description has been provided for this image

3.2.3. Airline segmentation analysis of the price variable¶

Chart-7. Barplot for Airline segmentation analysis of the price variable¶

In [25]:
# ===== Figure settings =====
plt.figure(figsize=(20, 10))
ax = plt.gca()

# ===== Sort airlines by median price for better visual =====
airline_order = df.groupby('Airline')['Price'].median().sort_values().index

colors = np.linspace(0, 1, len(df['Airline'].unique()))
cmap = LinearSegmentedColormap.from_list("maroon_gold", ["#FFD700", "#800000"])
bar_colors = [cmap(val) for val in colors]

sns.barplot(
    x='Airline',
    y='Price',
    data=df,
    order=airline_order,
    palette=bar_colors,
    edgecolor='black',
    ci=None
)

ax.set_facecolor('#EDEDED')
ax.grid(axis='y', linestyle='-', alpha=0.2)
plt.title('Airline vs Price', fontsize=24, weight='bold', color='#222222')
plt.xlabel('Airline', fontsize=18, weight='bold', color='#333333')
plt.ylabel('Price (INR)', fontsize=18, weight='bold', color='#333333')
plt.xticks(rotation=45, ha='right', fontsize=12, weight='bold', color='#222222')
plt.yticks(fontsize=12, weight='bold', color='#222222')

plt.tight_layout()
plt.show()
No description has been provided for this image

1. Why did you pick the specific chart?

  • A bar chart is best suited for comparing categorical variables (Airlines) against a numerical variable (Price).

  • It provides a clear visual comparison of average ticket prices across different airlines and classes.

  • The chart makes it easy to spot outliers and trends, such as which airline or service class is significantly more expensive.

2. What is/are the insight(s) found from the chart?

  • Low-cost carriers (SpiceJet, Trujet, IndiGo, GoAir, Air Asia) have the lowest average ticket prices (₹4,000–₹6,000).

  • Full-service airlines like Vistara and Air India charge moderate fares (₹7,000–₹12,000).

  • Premium Economy fares are slightly higher (₹10,000–₹12,000).

  • Jet Airways Business Class is a major outlier, priced at ~₹58,000, which is 5–10 times higher than economy fares.

  • Pricing differences are influenced more by class of travel than the airline itself.

3. Will the gained insights help creating a positive business impact?

  • Yes.

  • These insights help in market segmentation (budget vs premium travelers).

  • Airlines can adjust pricing strategies and highlight value-added services to justify higher prices.

  • Travel agencies and booking platforms can personalize recommendations based on customer budget and preferences, improving customer satisfaction and conversion rates.

  • Helps businesses target promotions effectively (e.g., discounts for economy class to attract price-sensitive travelers, premium packages for business class customers).

3.2.4. Categorigal features analysis of the price variable¶

Chart-8. Baxplot for Categorical features analysis of the price variable¶

In [26]:
# ===== Select categorical columns excluding 'Route' and 'Airline' =====
categorical_cols = df.select_dtypes(include='object').columns
categorical_cols = [col for col in categorical_cols if col not in ['Route', 'Airline']]

# ===== Figure settings =====
n_cols = 2
n_rows = (len(categorical_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 5*n_rows))
axes = axes.flatten()

# ===== Loop through categorical columns =====
for i, col in enumerate(categorical_cols):
    ax = axes[i]
    # ===== Boxplot =====
    sns.boxplot(
        x=col,
        y='Price',
        data=df,
        ax=ax,
        boxprops=dict(facecolor='#FFD700', color='black', linewidth=1.2),
        whiskerprops=dict(color='black', linewidth=1),
        capprops=dict(color='black', linewidth=1),
        medianprops=dict(color='#800000', linewidth=2),
        flierprops=dict(marker='o', markerfacecolor='#800000', markersize=5, alpha=0.8, markeredgecolor='black')
    )

    ax.set_title(f'{col} vs Price', fontsize=16, weight='bold', color='#222222')
    ax.set_xlabel(col, fontsize=12, weight='bold', color='#333333')
    ax.set_ylabel('Price (INR)', fontsize=12, weight='bold', color='#333333')
    ax.tick_params(axis='x', rotation=45, labelsize=10)
    ax.grid(axis='y', linestyle='--', alpha=0.3)

for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.suptitle('Categorical Features vs Price', fontsize=20, weight='bold', y=1.05)
plt.show()
No description has been provided for this image

1. Source vs Price

  • Flights originating from Delhi and Kolkata generally have higher median prices compared to other cities.

  • Chennai and Mumbai show lower price ranges on average.

  • Bangalore has a wide range of prices, including extreme outliers (up to 80,000 INR).

  • Price variation is highly dependent on source location, suggesting departure city impacts ticket pricing.

2. Destination vs Price

  • Flights with destination Cochin and Banglore generally show higher price ranges.

  • Delhi and Hyderabad destinations have moderate prices.

  • Kolkata as a destination is relatively cheaper compared to others.

  • Like the source, destination significantly influences ticket price variations.

3. Total Stops vs Price

  • Non-stop flights are the cheapest overall.

  • 1 stop and 2 stops have noticeably higher median prices.

  • 3 stops and especially 4 stops flights are very expensive (close to 20k–30k INR consistently).

  • Prices increase with the number of stops, though not always linearly (e.g., 1-stop flights are often more expensive than 2-stop flights due to demand/supply factors).

4. Additional Info vs Price

  • Business class flights have the highest ticket prices, with a wide range reaching 80,000 INR.

  • Passengers with “No info” or basic inclusions/exclusions (like no check-in baggage or meals not included) tend to pay less.

  • Flights with layovers or changes in airports are priced higher than simple direct ones.

  • Red-eye flights are among the cheapest options.

  • This shows service and travel conditions strongly influence prices.

Overall Insights:

  • Ticket prices are strongly influenced by categorical factors like source, destination, number of stops, and additional services.

  • Business class and multi-stop flights are significantly more expensive.

  • Non-stop, red-eye, and flights from certain cities (like Chennai & Kolkata) tend to be cheaper.

3.3. Multivariate Analysis: Examines multiple variables simultaneously¶

3.3.1. Correlation Heatmap: Highlights correlations between numerical features¶

Chart-9. Correlation Heatmap¶

In [27]:
# ===== Correlation Heatmap visualization code =====
numeric_df = df.select_dtypes(include=['number'])

custom_cmap = sns.color_palette("blend:#FFD700,white,#800000", as_cmap=True)

plt.figure(figsize=(15,6))
sns.heatmap(
    numeric_df.corr(),
    annot=True,
    fmt=".2f",
    cmap=custom_cmap,
    center=0,
    linewidths=1.5,
    linecolor="lightgrey",
    annot_kws={"size":12, "weight":"bold", "color":"black"},
    cbar_kws={"shrink":0.7, "aspect":30, "label":"Correlation Strength"}
)

plt.title("Correlation Heatmap of Numeric Features",
          fontsize=16, fontweight="bold", color="black", pad=20)
plt.xticks(rotation=45, ha="right", fontsize=11, weight="bold", color="#222")
plt.yticks(rotation=0, fontsize=11, weight="bold", color="#222")
plt.grid(False)
plt.tight_layout()
plt.show()
No description has been provided for this image

Strong Positive Correlations:

Feature 1 Feature 2 Correlation (r)
Arrival_minutes Duration_minutes 0.81

Moderate Positive Correlations:

Feature 1 Feature 2 Correlation (r)
Price Arrival_minutes 0.41
Price Duration_minutes 0.50
Dep_minutes Arrival_minutes 0.56

Weak Positive Correlations:

Feature 1 Feature 2 Correlation (r)
Price Journey_weekday 0.06
Journey_month Dep_minutes 0.04
Journey_month Arrival_minutes 0.03

Weak Negative Correlations:

Feature 1 Feature 2 Correlation (r)
Price Journey_day -0.16
Price Journey_month -0.11
Journey_day Journey_month -0.04
Journey_day Journey_weekday -0.09
Journey_day Dep_minutes -0.00
Journey_day Arrival_minutes -0.03
Journey_day Duration_minutes -0.03
Journey_month Journey_weekday -0.08

3.3.2. Stacked Bar Chart – Airline vs Source vs Price¶

Chart-10. Stacked Bar Chart – Airline vs Source vs Price¶

In [28]:
# ===== Stacked Bar Chart – Airline vs Source vs Price   =====
# === Prepare grouped data ===
route_airline = df.groupby(["Airline", "Source"])["Price"].count().unstack().fillna(0)

# ===== Convert counts to percentages =====
route_airline_pct = route_airline.div(route_airline.sum(axis=1), axis=0) * 100

# === Define custom color palette for Sources ===
custom_colors = {
    "Banglore": "gold",
    "Chennai": "navy",
    "Delhi": "maroon",
    "Kolkata": "purple",
    "Mumbai": "green"
}

colors = [custom_colors.get(col, "gray") for col in route_airline_pct.columns]
fig, ax = plt.subplots(figsize=(20,9))
route_airline_pct.plot(
    kind="barh",
    stacked=True,
    color=colors,
    edgecolor="black",
    linewidth=0.7,
    ax=ax
)

# Title & Labels
plt.title("Airline vs Source – Flight Distribution", fontsize=22, weight="bold", color="#222831", pad=20)
plt.xlabel("Percentage of Flights (%)", fontsize=16, weight="bold", color="#393E46", labelpad=15)
plt.ylabel("Airline", fontsize=16, weight="bold", color="#393E46", labelpad=15)

# Y-axis ticks
plt.yticks(fontsize=12, weight="bold", color="#222831")

# X-axis ticks
plt.xticks(fontsize=12, weight="bold", color="#222831")

# Grid only on x-axis
plt.grid(axis="x", linestyle="--", alpha=0.4)

# Legend styling (bottom left, outside plot area)
legend = plt.legend(
    title="Source",
    fontsize=12,
    title_fontsize=13,
    loc="lower left",
    bbox_to_anchor=(-0.19, -0.03),
    frameon=True,
    shadow=True,
    fancybox=True,
    borderpad=1
)
plt.setp(legend.get_title(), weight="bold")

# === Annotate percentages inside bars (skip 0%) ===
for container in ax.containers:
    labels = [f"{w:.1f}%" if w > 0 else "" for w in container.datavalues]
    ax.bar_label(container, labels=labels, label_type="center", fontsize=10, weight="bold", color="black")

plt.tight_layout()
plt.show()
No description has been provided for this image

Insights:¶

  • Bangalore is the leading source city, dominating airlines like Vistara Premium Economy (66.7%) and Jet Airways Business (66.7%).

  • Delhi is the second strongest hub, contributing 100% for Multiple carriers and over 40% for Air India.

  • TruJet flights originate only from Kolkata (100%), showing city exclusivity.

  • Multiple carriers (regular and premium economy) depend entirely on Delhi (100%).

  • SpiceJet shows a balanced spread, with Hyderabad (36.8%), Bangalore (21.8%), Chennai (15.7%), and Mumbai (15%).

  • Vistara is also well distributed, with nearly equal contributions from Bangalore (38.4%) and Hyderabad (38.4%).

  • IndiGo has the widest mix: Delhi (34.5%), Bangalore (25.1%), and notable shares from other cities.

  • GoAir is split between Bangalore (47.9%) and Delhi (39.2%), with a smaller share from Mumbai (12.9%).

  • Air India is Delhi-heavy (41.5%), but also significant in Hyderabad (29.6%) and Bangalore (19.4%).

  • Mumbai and Chennai play supporting roles, contributing moderate shares in multi-city airlines but rarely dominate.

3.3.3. Pairplot¶

Chart-11. Pairplot¶

In [29]:
# ===== Pair Plot visualization code =====
numeric_df = df.select_dtypes(include=['number'])

sns.pairplot(numeric_df, plot_kws={"color": "maroon"}, diag_kws={"color": "#FFD700"})
plt.show()
No description has been provided for this image

1. Why did you pick the specific chart?

  • Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters.

3.4. Hypothesis Testing¶

Based on the chart experiments, define three hypothetical statements about the dataset. In the next three answers, perform hypothesis testing to obtain a final conclusion about the statements through your code and statistical testing.¶

3.4.1. Hypothetical Statement - 1¶

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.¶

Hypotheses:

  • Null Hypothesis (H0): Average flight price is the same across all airlines.

  • Alternative Hypothesis (H1): Average flight price significantly differs among airlines.

2. Perform an appropriate statistical test¶

In [30]:
# ===== Create contingency table =====
# ===== Group prices by Airline =====
groups = [df[df['Airline'] == airline]['Price'] for airline in df['Airline'].unique()]

# ===== Perform ANOVA =====
f_stat, p_val = f_oneway(*groups)

print("Airline vs Price - ANOVA Test\n")
print("F-statistic:", f_stat)
print("P-value:", p_val)

# ===== Interpretation =====
if p_val < 0.05:
    print("\nResult: Reject H0 → Airline has a significant impact on flight price.")
else:
    print("\nResult: Fail to Reject H0 → No significant difference across Airlines.")
Airline vs Price - ANOVA Test

F-statistic: 654.1998364047217
P-value: 0.0

Result: Reject H0 → Airline has a significant impact on flight price.

Why One-way ANOVA test?

  • Because Airline is categorical and Price is continuous, ANOVA tests whether the mean prices differ significantly across multiple airlines.

3. Business Insight:¶

  • Different airlines charge significantly different ticket prices → pricing is not uniform across airlines.

  • Airline choice strongly influences customer cost, so passengers may compare airlines for affordability, while airlines can use this insight for competitive pricing.

3.4.2. Hypothetical Statement - 2¶

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.¶

Hypotheses:

  • Null Hypothesis (H0): The number of stops has no impact on ticket prices.

  • Alternative Hypothesis (H1): The number of stops significantly affects ticket prices.

2. Perform an appropriate statistical test¶

In [31]:
# ===== Create contingency table =====
# ===== Group prices by Total Stops =====
groups = [df[df['Total_Stops'] == stop]['Price'] for stop in df['Total_Stops'].unique()]

# ===== Perform ANOVA =====
f_stat, p_val = f_oneway(*groups)

print("Total_Stops vs Price - ANOVA Test\n")
print("F-statistic:", f_stat)
print("P-value:", p_val)

# ===== Interpretation =====
if p_val < 0.05:
    print("\nResult: Reject H0 → Number of stops significantly affects flight price.")
else:
    print("\nResult: Fail to Reject H0 → Stops do not significantly affect flight price.")
Total_Stops vs Price - ANOVA Test

F-statistic: 1722.028490860059
P-value: 0.0

Result: Reject H0 → Number of stops significantly affects flight price.

Why One-way ANOVA test?

  • Total_Stops is a categorical variable with more than two groups (non-stop, 1 stop, 2 stops, etc.).

  • ANOVA checks if the average ticket prices are significantly different across these multiple categories.

3. Business Insight:¶

  • Flights with longer durations or multiple stops tend to be priced higher, showing that passengers are often paying extra for convenience and faster travel.

  • Airlines can use this to optimize pricing strategies by offering competitive fares on high-demand, time-saving routes.

3.4.3. Hypothetical Statement - 3¶

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.¶

Hypotheses:

  • Null Hypothesis (H0): Flight duration has no correlation with ticket price.

  • Alternative Hypothesis (H1): Flight duration is significantly correlated with ticket price.

2. Perform an appropriate statistical test¶

In [32]:
# ===== Create contingency table =====
# ===== Perform Pearson Correlation =====
corr, p_val = pearsonr(df['Duration_minutes'], df['Price'])

print("\nDuration vs Price - Correlation Test\n")
print("Correlation Coefficient:", corr)
print("P-value:", p_val)

# ===== Interpretation =====
if p_val < 0.05:
    print("\nResult: Reject H0 → Flight duration is significantly correlated with Price.")
else:
    print("\nResult: Fail to Reject H0 → No significant correlation between Duration and Price.")
Duration vs Price - Correlation Test

Correlation Coefficient: 0.5017099519431807
P-value: 0.0

Result: Reject H0 → Flight duration is significantly correlated with Price.

Why Correlation test?

  • Both Duration_minutes and Price are numeric (continuous) variables.

  • Correlation test (Pearson’s r) checks whether there is a linear relationship between flight duration and ticket price.

3. Business Insight:¶

  • Longer flight durations are strongly associated with higher ticket prices. This indicates that as the travel time increases (especially for long-haul routes), the price tends to rise, which is crucial for both airlines in pricing strategy and passengers in planning cost-effective journeys.

4. Data Pre-Processing¶

4.1. Handling Missing Values / Null Values¶

In [33]:
# ===== Finding a missing values =====
df.isnull().sum().to_frame("Missing_Values")
Out[33]:
Missing_Values
Airline 0
Source 0
Destination 0
Route 0
Total_Stops 0
Additional_Info 0
Price 0
Journey_day 0
Journey_month 0
Journey_weekday 0
Dep_minutes 0
Arrival_minutes 0
Duration_minutes 0

The dataset has been checked for missing values, and no null or missing entries were found, indicating that the data is complete and clean for analysis.

4.2. Handling Outliers: Detection and Treatment Strategies¶

4.2.1. Perform outlier detection:¶

Chart-12. Plotting box plots for all numerical variable¶

In [34]:
# ===== Plotting box plots for all numerical variable =====
numeric_df = df.select_dtypes(include=['number'])

# ===== Background =====
plt.style.use('ggplot')

plt.figure(figsize=(20, 15))
num_plots = min(len(numeric_df.columns), 13)

for i, col in enumerate(numeric_df.columns[:num_plots]):
    plt.subplot(4, 4, i + 1)

    sns.boxplot(
        data=df,
        x=col,
        boxprops=dict(color='#FFD700', facecolor='#800000', linewidth=3),
        flierprops=dict(marker='o', markerfacecolor='#800000', markersize=5, linestyle='none'),
        medianprops=dict(color='#800000', linewidth=2),
        whiskerprops=dict(color='#FFD700', linewidth=3),
        capprops=dict(color='#FFD700', linewidth=3)
    )

    plt.title(col, fontsize=12, fontweight='bold')
    plt.xlabel('')

plt.suptitle("Outlier Visualization in Numerical Columns", fontsize=20, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
No description has been provided for this image

4.2.2. Calculate the number of outliers and their percentage:¶

In [35]:
# ===== Defining the function for outlier detection and percentage calculation using IQR =====

def detect_outliers(data):
    data = np.array(data)

    # ===== Quartiles =====
    q1 = np.percentile(data, 25)
    q2 = np.percentile(data, 50)
    q3 = np.percentile(data, 75)

    # ===== IQR & boundsa =====
    IQR = q3 - q1
    lower_bound = q1 - 1.5 * IQR
    upper_bound = q3 + 1.5 * IQR

    # ===== Outlier detection =====
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    outlier_count = len(outliers)
    outlier_percent = round(outlier_count * 100 / len(data), 2)

    # ===== Display results =====
    print(f"Q1 = {q1}, Q2 (Median) = {q2:.2f}, Q3 = {q3}")
    print(f"IQR = {IQR:.2f}")
    print(f"Lower Bound = {lower_bound:.2f}, Upper Bound = {upper_bound:.2f}")
    print(f"Outliers Detected: {outlier_count}")
    print(f"Outlier Percentage: {outlier_percent}%\n")
In [36]:
# ===== Calculating IQR, Lower/Upper Bounds, and Outlier Counts for Continuous Numerical Features =====

for feature in numeric_df:
  print(feature,":")
  detect_outliers(df[feature])
  print("*"*50)
Price :
Q1 = 5224.0, Q2 (Median) = 8266.00, Q3 = 12346.25
IQR = 7122.25
Lower Bound = -5459.38, Upper Bound = 23029.62
Outliers Detected: 94
Outlier Percentage: 0.9%

**************************************************
Journey_day :
Q1 = 6.0, Q2 (Median) = 12.00, Q3 = 21.0
IQR = 15.00
Lower Bound = -16.50, Upper Bound = 43.50
Outliers Detected: 0
Outlier Percentage: 0.0%

**************************************************
Journey_month :
Q1 = 3.0, Q2 (Median) = 5.00, Q3 = 6.0
IQR = 3.00
Lower Bound = -1.50, Upper Bound = 10.50
Outliers Detected: 0
Outlier Percentage: 0.0%

**************************************************
Journey_weekday :
Q1 = 1.0, Q2 (Median) = 3.00, Q3 = 5.0
IQR = 4.00
Lower Bound = -5.00, Upper Bound = 11.00
Outliers Detected: 0
Outlier Percentage: 0.0%

**************************************************
Dep_minutes :
Q1 = 480.0, Q2 (Median) = 705.00, Q3 = 1080.0
IQR = 600.00
Lower Bound = -420.00, Upper Bound = 1980.00
Outliers Detected: 0
Outlier Percentage: 0.0%

**************************************************
Arrival_minutes :
Q1 = 980.0, Q2 (Median) = 1305.00, Q3 = 1720.0
IQR = 740.00
Lower Bound = -130.00, Upper Bound = 2830.00
Outliers Detected: 56
Outlier Percentage: 0.54%

**************************************************
Duration_minutes :
Q1 = 170.0, Q2 (Median) = 505.00, Q3 = 910.0
IQR = 740.00
Lower Bound = -940.00, Upper Bound = 2020.00
Outliers Detected: 75
Outlier Percentage: 0.72%

**************************************************
Feature Name Outlier % Action Reason
Price 0.9% Rectify Small % of high-ticket prices; may represent premium/business class.
Journey_day 0.0% Keep No outliers detected; values lie within 1–31.
Journey_month 0.0% Keep No outliers detected; values are within valid month range (1–12).
Journey_weekday 0.0% Keep No outliers detected; weekdays range between 0–6.
Dep_minutes 0.0% Keep No outliers detected; departure times are valid (0–1440 min).
Arrival_minutes 0.54% Keep Very few late arrivals; possible due to long-haul flights.
Duration_minutes 0.72% Rectify Some extreme durations; may reflect connecting or international flights.

4.2.3. Outlier removal operation:¶

In [37]:
# ===== Defining the function for outlier removal code =====

def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    filtered_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    print(f"Removed {df.shape[0] - filtered_df.shape[0]} outliers from '{column}'")
    return filtered_df
In [38]:
# ===== Run code =====
# ===== copy for camparison purposs =====
df_clean = df.copy()

df_clean = remove_outliers_iqr(df_clean, 'Price')
df_clean = remove_outliers_iqr(df_clean, 'Duration_minutes')
Removed 94 outliers from 'Price'
Removed 74 outliers from 'Duration_minutes'

4.2.4. After the outliers were removed:¶

Chart-13. Boxplot Comparison (Before and After)¶

In [39]:
# ===== Boxplot comparison code =====

box_style = dict(
    boxprops=dict(color='#FFD700', facecolor='#FFD700', linewidth=3),
    flierprops=dict(marker='o', markerfacecolor='#800000', markersize=5, linestyle='none'),
    medianprops=dict(color='#800000', linewidth=2),
    whiskerprops=dict(color='#FFD700', linewidth=3),
    capprops=dict(color='#FFD700', linewidth=3)
)

columns_to_plot = ['Price', 'Duration_minutes']
titles = ['Price', 'Duration_minutes']

fig, axes = plt.subplots(2, 1, figsize=(15, 7))

for i, col in enumerate(columns_to_plot):
    combined_data = pd.concat([df[col], df_clean[col]])
    group_labels = ['Before'] * len(df[col]) + ['After'] * len(df_clean[col])

    sns.boxplot(
        y=group_labels,
        x=combined_data,
        ax=axes[i],
        color='white',
        **box_style
    )

    axes[i].set_title(f'{titles[i]} (Before vs After)', fontsize=16, fontweight='bold')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].grid(True, axis='x', linestyle='--', alpha=0.5)

    axes[i].set_yticklabels(['Before', 'After'], fontsize=14, weight='bold')
    for tick in axes[i].get_yticklabels():
        if tick.get_text() == 'Before':
            tick.set_color('crimson')
        elif tick.get_text() == 'After':
            tick.set_color('darkgreen')

plt.suptitle('Boxplot Comparison', fontsize=20, fontweight='bold', color='navy')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
No description has been provided for this image
In [40]:
# ===== After comparing box plots, I made the following changes =====

df = df_clean.copy()

5. Feature Engineering¶

5.1. Feature Selection¶

5.1.1. Encoding Categorical Variables¶

In [41]:
# ===== Categorical Features =====
# ===== Run code =====
categorical_cols = df.select_dtypes(include='object')
for col in categorical_cols:
    print(f"Column: '{col}'")
    print(f" * Unique Categories: {df[col].nunique()}")
    print(f" * Category Distribution:\n{df[col].value_counts(dropna=False)}")
    print("-" * 35)
Column: 'Airline'
 * Unique Categories: 11
 * Category Distribution:
Airline
Jet Airways                          3613
IndiGo                               2043
Air India                            1630
Multiple carriers                    1186
SpiceJet                              814
Vistara                               477
Air Asia                              318
GoAir                                 194
Multiple carriers Premium economy      13
Vistara Premium economy                 3
Trujet                                  1
Name: count, dtype: int64
-----------------------------------
Column: 'Source'
 * Unique Categories: 5
 * Category Distribution:
Source
Delhi       4271
Kolkata     2853
Banglore    2097
Mumbai       690
Chennai      381
Name: count, dtype: int64
-----------------------------------
Column: 'Destination'
 * Unique Categories: 5
 * Category Distribution:
Destination
Cochin       4271
Banglore     2853
Delhi        2097
Hyderabad     690
Kolkata       381
Name: count, dtype: int64
-----------------------------------
Column: 'Route'
 * Unique Categories: 125
 * Category Distribution:
Route
DEL → BOM → COK                      2368
BLR → DEL                            1532
CCU → BOM → BLR                       979
CCU → BLR                             723
BOM → HYD                             621
                                     ... 
BOM → JAI → DEL → HYD                   1
BLR → HBX → BOM → NAG → DEL             1
BLR → BOM → IXC → DEL                   1
BLR → CCU → BBI → HYD → VGA → DEL       1
BOM → BBI → HYD                         1
Name: count, Length: 125, dtype: int64
-----------------------------------
Column: 'Total_Stops'
 * Unique Categories: 5
 * Category Distribution:
Total_Stops
1 stop      5550
non-stop    3470
2 stops     1242
3 stops       29
4 stops        1
Name: count, dtype: int64
-----------------------------------
Column: 'Additional_Info'
 * Unique Categories: 6
 * Category Distribution:
Additional_Info
No info                         8040
In-flight meal not included     1918
No check-in baggage included     318
1 Long layover                     9
Change airports                    6
Red-eye flight                     1
Name: count, dtype: int64
-----------------------------------
Feature Name Type Example Values Recommended Encoding Reason
Airline Categorical (Multi-class) Jet Airways, IndiGo, Air India, SpiceJet, Vistara One-Hot Encoding Nominal variable with no order; model should not assume ranking.
Source Categorical (Multi-class) Delhi, Kolkata, Banglore, Mumbai, Chennai One-Hot Encoding Nominal locations; no ordinal relationship.
Destination Categorical (Multi-class) Cochin, Banglore, Delhi, Hyderabad, Kolkata One-Hot Encoding Nominal locations; no ordinal relationship.
Route High-cardinality Categorical DEL → BOM → COK, BLR → DEL, CCU → BLR (125 unique) Target / Frequency Encoding Too many categories; one-hot would explode dimensionality.
Total_Stops Ordinal Categorical non-stop, 1 stop, 2 stops, 3 stops, 4 stops Ordinal Encoding Clear increasing order; can be mapped (e.g., 0–4).
Additional_Info Categorical (Multi-class) No info, In-flight meal not included, Red-eye flight One-Hot Encoding Few categories, no order; one-hot is simple and effective.
In [42]:
# ===== Encode the categorical features =====
# ===== Define Feature Groups =====
one_hot_features = ["Airline", "Source", "Destination", "Additional_Info"]
ordinal_features = ["Total_Stops"]
frequency_features = ['Route']

# Ordinal mapping for Total_Stops
ordinal_mapping = [['non-stop', '1 stop', '2 stops', '3 stops', '4 stops']]

# ===== Build Encoding Pipeline =====
preprocessor = ColumnTransformer(
    transformers=[
        ("onehot", OneHotEncoder(drop="first"), one_hot_features),
        ("ordinal", OrdinalEncoder(categories=ordinal_mapping), ordinal_features),
        ("freq", CountEncoder(), frequency_features)
    ],
    # ===== keep numeric + binary features =====
    remainder="passthrough"
)

# ===== Fit & Transform =====
df_fit = preprocessor.fit_transform(df)

# ===== Get Feature Names =====
onehot_feature_names = preprocessor.named_transformers_["onehot"].get_feature_names_out(one_hot_features)
ordinal_feature_names = ordinal_features
frequency_features_names = frequency_features
passthrough_features = [col for col in df.columns if col not in one_hot_features + ordinal_features + frequency_features]

# Final feature names
final_feature_names = list(onehot_feature_names) + ordinal_feature_names + frequency_features_names + passthrough_features

# ===== Convert to DataFrame =====
df_encoded = pd.DataFrame(df_fit, columns=final_feature_names, index=df.index)

# ===== Convert all boolean columns to integers =====
bool_cols = df_encoded.select_dtypes(include='bool').columns
df_encoded[bool_cols] = df_encoded[bool_cols].astype(int)

# ===== Final Output =====
print("Shape of encoded dataset:", df_encoded.shape)
print(df_encoded.head())
Shape of encoded dataset: (10292, 32)
   Airline_Air India  Airline_GoAir  Airline_IndiGo  Airline_Jet Airways  \
0                0.0            0.0             1.0                  0.0   
1                1.0            0.0             0.0                  0.0   
2                0.0            0.0             0.0                  1.0   
3                0.0            0.0             1.0                  0.0   
4                0.0            0.0             1.0                  0.0   

   Airline_Multiple carriers  Airline_Multiple carriers Premium economy  \
0                        0.0                                        0.0   
1                        0.0                                        0.0   
2                        0.0                                        0.0   
3                        0.0                                        0.0   
4                        0.0                                        0.0   

   Airline_SpiceJet  Airline_Trujet  Airline_Vistara  \
0               0.0             0.0              0.0   
1               0.0             0.0              0.0   
2               0.0             0.0              0.0   
3               0.0             0.0              0.0   
4               0.0             0.0              0.0   

   Airline_Vistara Premium economy  ...  Additional_Info_Red-eye flight  \
0                              0.0  ...                             0.0   
1                              0.0  ...                             0.0   
2                              0.0  ...                             0.0   
3                              0.0  ...                             0.0   
4                              0.0  ...                             0.0   

   Total_Stops   Route    Price  Journey_day  Journey_month  Journey_weekday  \
0          0.0  1532.0   3897.0         24.0            3.0              6.0   
1          2.0     6.0   7662.0          1.0            5.0              2.0   
2          2.0    41.0  13882.0          9.0            6.0              6.0   
3          1.0     9.0   6218.0         12.0            5.0              6.0   
4          1.0     3.0  13302.0          1.0            3.0              4.0   

   Dep_minutes  Arrival_minutes  Duration_minutes  
0       1340.0           1510.0             170.0  
1        350.0            795.0             445.0  
2        565.0           1705.0            1140.0  
3       1085.0           1410.0             325.0  
4       1010.0           1295.0             285.0  

[5 rows x 32 columns]

This code is building a clean encoding pipeline for the Flight Price Prediction dataset. First, binary or numeric features are kept as they are using remainder="passthrough". Then, categorical variables such as Airline, Source, Destination, and Additional_Info are transformed using One-Hot Encoding, while Total_Stops is handled with Ordinal Encoding based on the natural stop hierarchy (non-stop < 1 stop < 2 stops < …). The Route feature is encoded using Frequency Encoding to capture the importance of popular versus rare routes. After transformation, the code neatly reconstructs a DataFrame with proper feature names, ensuring that the dataset is fully prepared for downstream machine learning models.

In [43]:
# ===== Checking =====
df_encoded.tail(10).T
Out[43]:
10671 10674 10675 10676 10677 10678 10679 10680 10681 10682
Airline_Air India 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
Airline_GoAir 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Airline_IndiGo 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Airline_Jet Airways 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
Airline_Multiple carriers 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
Airline_Multiple carriers Premium economy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Airline_SpiceJet 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
Airline_Trujet 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Airline_Vistara 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
Airline_Vistara Premium economy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Source_Chennai 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Source_Delhi 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
Source_Kolkata 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
Source_Mumbai 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Destination_Cochin 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
Destination_Delhi 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0
Destination_Hyderabad 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Destination_Kolkata 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Additional_Info_Change airports 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Additional_Info_In-flight meal not included 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Additional_Info_No check-in baggage included 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
Additional_Info_No info 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
Additional_Info_Red-eye flight 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Total_Stops 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0
Route 621.0 341.0 621.0 2368.0 1532.0 723.0 723.0 1532.0 1532.0 44.0
Price 3100.0 11087.0 3100.0 9794.0 3257.0 4107.0 4145.0 7229.0 12648.0 11753.0
Journey_day 6.0 12.0 9.0 1.0 21.0 9.0 27.0 27.0 1.0 9.0
Journey_month 6.0 3.0 6.0 5.0 5.0 4.0 4.0 4.0 3.0 5.0
Journey_weekday 3.0 1.0 6.0 2.0 1.0 1.0 5.0 5.0 4.0 3.0
Dep_minutes 1265.0 1235.0 380.0 620.0 355.0 1195.0 1245.0 500.0 690.0 655.0
Arrival_minutes 1345.0 2720.0 460.0 1140.0 515.0 1345.0 1400.0 680.0 850.0 1155.0
Duration_minutes 80.0 1485.0 80.0 520.0 160.0 150.0 155.0 180.0 160.0 500.0

5.1.2. Correlation Heatmap of Features¶

Chart-14. Correlation Heatmap of Features¶

In [44]:
# ===== Select your features wisely to avoid overfitting =====
# ===== Correlation Heatmap visualization code =====

corr = df_encoded.corr(numeric_only=True)
top_features = corr.abs().nlargest(10, 'Price').index
top_corr = df_encoded[top_features].corr()

custom_cmap = sns.color_palette("blend:#FFD700,white,#800000", as_cmap=True)

plt.figure(figsize=(15,6))
sns.heatmap(
    top_corr,
    annot=True,
    fmt=".2f",
    cmap=custom_cmap,
    center=0,
    linewidths=1.5,
    linecolor="lightgrey",
    annot_kws={"size":12, "weight":"bold", "color":"black"},
    cbar_kws={"shrink":0.7, "aspect":30, "label":"Correlation Strength"}
)

plt.title("Top Feature Correlations",
          fontsize=16, fontweight="bold", color="black", pad=20)
plt.xticks(rotation=45, ha="right", fontsize=11, weight="bold", color="#222")
plt.yticks(rotation=0, fontsize=11, weight="bold", color="#222")
plt.grid(False)
plt.tight_layout()
plt.show()
No description has been provided for this image

The final dataframe will include only the most influential features, with multicollinearity checked using Variance Inflation Factor (VIF)

Feature Correlation with Price Type of Relationship Observation
Total_Stops +0.67 Strong Positive More stops significantly increase ticket prices.
Duration_minutes +0.57 Positive Longer flight durations tend to have higher prices.
Arrival_minutes +0.46 Positive Later arrival times are moderately linked with higher ticket prices.
Airline_Jet Airways +0.45 Positive Jet Airways tickets are strongly associated with higher prices.
Airline_IndiGo -0.38 Negative IndiGo flights generally have lower ticket prices.
Airline_SpiceJet -0.32 Negative SpiceJet flights are associated with cheaper prices.
Source_Delhi +0.32 Positive Flights originating from Delhi are moderately linked with higher prices.
Destination_Cochin +0.32 Positive Cochin as a destination correlates with higher flight prices.
Source_Mumbai -0.26 Negative Mumbai-origin flights tend to have lower prices.

The strongest predictors for Price are:

  • Positive: Total_Stops, Duration_minutes, Arrival_minutes, Airline_Jet Airways, Source_Delhi, Destination_Cochin

  • Negative: Airline_IndiGo, Airline_SpiceJet, Source_Mumbai

5.1.3. Variance Inflation Factor¶

In [45]:
# ===== Defining a function for variance_inflation_factor =====

def calc_vif(df):
    """
    Calculates Variance Inflation Factor (VIF) for each numerical feature in the dataframe.

    Parameters:
        df (pd.DataFrame): Input dataframe with features

    Returns:
        pd.DataFrame: VIF values sorted in descending order
    """
    # ===== Select only numeric columns =====
    X = df.select_dtypes(include=[np.number])

    # ===== Add constant to the model for intercept =====
    X = add_constant(X)

    # ===== Compute VIF for each feature =====
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    # ===== Drop the constant term and sort results =====
    vif_data = vif_data[vif_data["Feature"] != "const"]
    return vif_data.sort_values(by="VIF", ascending=False).reset_index(drop=True)

VIF (Variance Inflation Factor):¶

Calculating VIF(Variance Inflation Factor) by excluding:

VIF Value Interpretation
1 No multicollinearity
1–5 Moderate multicollinearity (generally okay)
> 5 High multicollinearity (needs investigation)
> 10 Severe multicollinearity (consider removal)

"Price" -> As it is target variable

In [46]:
# ===== Run code =====
df_encoded_vif = df_encoded.drop("Price", axis=1).copy()
vif_result = calc_vif(df_encoded_vif)
print(vif_result)
                                         Feature         VIF
0                                 Source_Kolkata         inf
1                                  Source_Mumbai         inf
2                             Destination_Cochin         inf
3                                   Source_Delhi         inf
4                                 Source_Chennai         inf
5                          Destination_Hyderabad         inf
6                              Destination_Delhi         inf
7                            Destination_Kolkata         inf
8                        Additional_Info_No info  197.207779
9    Additional_Info_In-flight meal not included  176.318554
10                               Arrival_minutes   77.797906
11                              Duration_minutes   52.058923
12  Additional_Info_No check-in baggage included   36.026966
13                                   Dep_minutes   26.214838
14                           Airline_Jet Airways    9.309516
15                                Airline_IndiGo    6.111731
16                             Airline_Air India    5.627517
17                     Airline_Multiple carriers    5.016531
18                              Airline_SpiceJet    3.957277
19                                   Total_Stops    3.721193
20                               Airline_Vistara    2.458753
21                                         Route    2.021528
22               Additional_Info_Change airports    1.673063
23                                 Airline_GoAir    1.613533
24                Additional_Info_Red-eye flight    1.115270
25                                 Journey_month    1.103272
26     Airline_Multiple carriers Premium economy    1.058257
27                                   Journey_day    1.033295
28                               Journey_weekday    1.019019
29               Airline_Vistara Premium economy    1.011878
30                                Airline_Trujet    1.005676

1. Extremely High VIF (very strong multicollinearity)

Variable VIF Observation
Source_Kolkata ∞ Perfect multicollinearity with other Source/Destination features.
Source_Mumbai ∞ Perfect multicollinearity with other Source/Destination features.
Source_Delhi ∞ Perfect multicollinearity with other Source/Destination features.
Source_Chennai ∞ Perfect multicollinearity with other Source/Destination features.
Destination_Cochin ∞ Perfect multicollinearity with other Destination variables.
Destination_Hyderabad ∞ Perfect multicollinearity with other Destination variables.
Destination_Delhi ∞ Perfect multicollinearity with other Destination variables.
Destination_Kolkata ∞ Perfect multicollinearity with other Destination variables.
Additional_Info_No info 197.21 Extremely high redundancy, not informative when combined with other features.
Additional_Info_In-flight meal not included 176.32 Extremely high redundancy, overlaps with other Additional_Info categories.
Arrival_minutes 77.80 Very strong correlation with Duration and Departure time.
Duration_minutes 52.06 Multicollinear with Arrival/Departure minutes.
Additional_Info_No check-in baggage included 36.03 Strong redundancy with other Additional_Info features.
Dep_minutes 26.21 Multicollinear with Duration and Arrival time.

2. High Multicollinearity

Variable VIF Observation
Airline_Jet Airways 9.31 High correlation with other airline dummy variables.

3. Moderate Multicollinearity

Variable VIF Observation
Airline_IndiGo 6.11 Some correlation with other airline categories.
Airline_Air India 5.63 Some correlation with other airline categories.
Airline_Multiple carriers 5.02 Some correlation with other airline categories.

4. Low VIF (safe to keep)

Variable VIF Observation
Airline_SpiceJet 3.96 Safe, minor correlation.
Total_Stops 3.72 Safe, slight correlation with Route.
Airline_Vistara 2.46 Safe, low correlation.
Route 2.02 Safe, captures travel path info.
Additional_Info_Change airports 1.67 Safe, independent.
Airline_GoAir 1.61 Safe, independent.
Additional_Info_Red-eye flight 1.12 Safe, independent.
Journey_month 1.10 Safe, independent.
Airline_Multiple carriers Premium economy 1.06 Safe, independent.
Journey_day 1.03 Safe, independent.
Journey_weekday 1.02 Safe, independent.
Airline_Vistara Premium economy 1.01 Safe, independent.
Airline_Trujet 1.01 Safe, independent.

Observations:

  • Source & Destination dummies create perfect multicollinearity (VIF = ∞) since they are mutually exclusive categories.

  • Time-related variables (Dep_minutes, Arrival_minutes, Duration_minutes) are highly correlated, leading to inflated VIF values.

  • Variables with VIF < 5 are perfectly fine for modeling.

Based on observational insights, the final model will use these 8 influential features, excluding the target variable 'Price'

S.No Feature Name Reason for Choosing
1 airline Categorical feature representing different airlines; flight prices vary significantly depending on the airline.
2 total_stops Number of stops in the journey; more stops usually reduce price, making it a strong predictor of flight cost.
3 route Flight path from source to destination; captures route-specific pricing patterns and stop combinations.
4 journey_day Day of the month when the flight is scheduled; helps capture date-specific pricing trends.
5 Journey_weekday Day of the week; helps model weekly demand patterns, e.g., weekends vs weekdays.
6 Journey_month Month of travel; captures seasonal trends and peak/off-peak pricing.
7 Arrival_minutes Arrival time in minutes; affects price based on arrival convenience, while duration captures the journey length.
8 Duration_minutes Total flight duration in minutes; longer flights generally cost more, making it a key predictor.

5.1.4. Feature selection:¶

In [47]:
# ===== Checking =====
df_encoded.columns
Out[47]:
Index(['Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo',
       'Airline_Jet Airways', 'Airline_Multiple carriers',
       'Airline_Multiple carriers Premium economy', 'Airline_SpiceJet',
       'Airline_Trujet', 'Airline_Vistara', 'Airline_Vistara Premium economy',
       'Source_Chennai', 'Source_Delhi', 'Source_Kolkata', 'Source_Mumbai',
       'Destination_Cochin', 'Destination_Delhi', 'Destination_Hyderabad',
       'Destination_Kolkata', 'Additional_Info_Change airports',
       'Additional_Info_In-flight meal not included',
       'Additional_Info_No check-in baggage included',
       'Additional_Info_No info', 'Additional_Info_Red-eye flight',
       'Total_Stops', 'Route', 'Price', 'Journey_day', 'Journey_month',
       'Journey_weekday', 'Dep_minutes', 'Arrival_minutes',
       'Duration_minutes'],
      dtype='object')
In [48]:
# ===== Creating final dataframe considering above selected features =====
final_df= df_encoded[['Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo', 'Airline_Jet Airways', 'Airline_Multiple carriers', 'Airline_Multiple carriers Premium economy',
                      'Airline_SpiceJet', 'Airline_Trujet', 'Airline_Vistara', 'Airline_Vistara Premium economy', 'Route', 'Total_Stops', 'Journey_day', 'Journey_month',
                      'Journey_weekday', 'Arrival_minutes', 'Duration_minutes', 'Price']]

Categorical Features:

  • Airline_Air India

  • Airline_GoAir

  • Airline_IndiGo

  • Airline_Jet Airways

  • Airline_Multiple carriers

  • Airline_Multiple carriers Premium economy

  • Airline_SpiceJet

  • Airline_Trujet

  • Airline_Vistara

  • Airline_Vistara Premium economy

  • Route

  • Total_Stops

  • Journey_day

  • Journey_weekday

  • Journey_month

Numerical Features:

  • Arrival_minutes

  • Duration_minutes

Target Variable:

  • Price
In [49]:
# ===== Check a final dataset =====

final_df.head().T
Out[49]:
0 1 2 3 4
Airline_Air India 0.0 1.0 0.0 0.0 0.0
Airline_GoAir 0.0 0.0 0.0 0.0 0.0
Airline_IndiGo 1.0 0.0 0.0 1.0 1.0
Airline_Jet Airways 0.0 0.0 1.0 0.0 0.0
Airline_Multiple carriers 0.0 0.0 0.0 0.0 0.0
Airline_Multiple carriers Premium economy 0.0 0.0 0.0 0.0 0.0
Airline_SpiceJet 0.0 0.0 0.0 0.0 0.0
Airline_Trujet 0.0 0.0 0.0 0.0 0.0
Airline_Vistara 0.0 0.0 0.0 0.0 0.0
Airline_Vistara Premium economy 0.0 0.0 0.0 0.0 0.0
Route 1532.0 6.0 41.0 9.0 3.0
Total_Stops 0.0 2.0 2.0 1.0 1.0
Journey_day 24.0 1.0 9.0 12.0 1.0
Journey_month 3.0 5.0 6.0 5.0 3.0
Journey_weekday 6.0 2.0 6.0 6.0 4.0
Arrival_minutes 1510.0 795.0 1705.0 1410.0 1295.0
Duration_minutes 170.0 445.0 1140.0 325.0 285.0
Price 3897.0 7662.0 13882.0 6218.0 13302.0

5.2. Data Transformation¶

5.2.1. Identify which features require transformation¶

In [50]:
# ===== checking which of the variables are continous in nature =====

for i in final_df.columns:
  print(f"The number of unique counts in feature {i} is: {final_df[i].nunique()}")
The number of unique counts in feature Airline_Air India is: 2
The number of unique counts in feature Airline_GoAir is: 2
The number of unique counts in feature Airline_IndiGo is: 2
The number of unique counts in feature Airline_Jet Airways is: 2
The number of unique counts in feature Airline_Multiple carriers is: 2
The number of unique counts in feature Airline_Multiple carriers Premium economy is: 2
The number of unique counts in feature Airline_SpiceJet is: 2
The number of unique counts in feature Airline_Trujet is: 2
The number of unique counts in feature Airline_Vistara is: 2
The number of unique counts in feature Airline_Vistara Premium economy is: 2
The number of unique counts in feature Route is: 54
The number of unique counts in feature Total_Stops is: 5
The number of unique counts in feature Journey_day is: 10
The number of unique counts in feature Journey_month is: 4
The number of unique counts in feature Journey_weekday is: 7
The number of unique counts in feature Arrival_minutes is: 301
The number of unique counts in feature Duration_minutes is: 343
The number of unique counts in feature Price is: 1805

Applying transformation techniques to the following features:

Feature Unique Counts
Arrival_minutes 301
Duration_minutes 343
Price 1805

5.2.2. Evaluate and apply necessary transformations¶

Chart-15. Examining the distribution and Q-Q plots for each continuous variable in our final dataframe¶

In [51]:
# ===== Checking the distribution and Q-Q plot of each continous variable from our final dataframe =====

# ===== Define continuous features to analyze =====
selected_features = ['Arrival_minutes', 'Duration_minutes', 'Price']

# ===== Check skewness =====
print("Skewness Before Transformation:")
for col in selected_features:
    skew_val = round(final_df[col].skew(), 2)
    print(f"  {col}: {skew_val}")

# ===== Set theme =====
sns.set_style("darkgrid")

# ===== Plot Distribution + Q-Q side by side for each feature =====
for col in selected_features:
    fig, axes = plt.subplots(1, 2, figsize=(14, 4.5))

    # ===== Distribution plot (left) =====
    sns.histplot(final_df[col], kde=True, color='#FFD700', ax=axes[0])
    axes[0].set_title(f'Distribution of {col}')

    # ===== Q-Q plot (right) =====
    stats.probplot(final_df[col], dist="norm", plot=axes[1])
    axes[1].set_title(f'Q-Q Plot of {col}')

    # ===== Overall title for this feature only =====
    fig.suptitle(f"Analysis of {col}", fontsize=16, fontweight="bold", color="black", y=1.02)

    plt.tight_layout()
    plt.show()
Skewness Before Transformation:
  Arrival_minutes: 0.46
  Duration_minutes: 0.81
  Price: 0.45
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

After analyzing the distributions, I've selected these features for Square root transformation:

  • Square root transformation → works well when skewness is moderate (0.5 – 1).
Feature Skewness
Duration_minutes 0.81

Chart-16. Square root transformation¶

In [52]:
# ===== Applying Square root transformation on the above considered columns =====

# ===== Apply Square Root Transformation =====
final_df['Duration_minutes'] = np.sqrt(final_df['Duration_minutes'])

print("After Applying Square Root Transformation")
print("Skewness:")
print(f" - Duration_minutes: {round(final_df['Duration_minutes'].skew(), 2)}")

# ===== Set theme =====
sns.set_style("darkgrid")

# ===== Create figure with 1 row, 2 columns =====
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# --- Left: Distribution plot ---
sns.histplot(final_df['Duration_minutes'], kde=True, bins=30, color='#FFD700', ax=axes[0])
axes[0].set_title("Distribution of Duration_minutes (After Square Root Transformation)")

# --- Right: Q-Q plot ---
stats.probplot(final_df['Duration_minutes'], dist="norm", plot=axes[1])
axes[1].set_title("Q-Q Plot of Duration_minutes (After Square Root Transformation)")

# ===== Add overall title for this pair =====
fig.suptitle(f"Analysis of Duration_minutes", fontsize=16, fontweight="bold", color="black", y=1.02)

plt.tight_layout()
plt.show()
After Applying Square Root Transformation
Skewness:
 - Duration_minutes: 0.34
No description has been provided for this image

5.3. Data Scaling - StandardScaler¶

In [53]:
# ===== Applying StandardScaler for Feature Normalization =====

final_scale_df = final_df.copy()

scaler = StandardScaler()
final_scale_df[['Arrival_minutes', 'Duration_minutes']] = scaler.fit_transform(final_scale_df[['Arrival_minutes', 'Duration_minutes']])

Which method have you used to scale you data and why?

To ensure optimal model performance and convergence, we standardized the data using StandardScaler from sklearn. This process transforms features to a common scale, preventing variables with larger inherent scales from dominating the model. Furthermore, standardization enables more meaningful comparison of model coefficients, simplifying the interpretation of each feature's influence.

6. Train-Test Split¶

6.1. Data Splitting¶

In [54]:
# ===== Split your data to train and test. Choose Splitting ratio wisely =====
x= final_scale_df.drop(columns='Price',axis=1)
y= final_scale_df[['Price']]

# ===== Spliting data =====
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
In [55]:
# ===== Checking the distribution of classes in training and testing sets =====
# ===== Dataset Split Summary =====
split_summary = pd.DataFrame({
    "Dataset": ["x_train", "x_test", "y_train", "y_test"],
    "Shape": [x_train.shape, x_test.shape, y_train.shape, y_test.shape]
})

print("Dataset Split Summary\n")
print(split_summary.to_string(index=False))
print("-" * 36)

# ===== Target Variable Summary Statistics =====
y_train_stats = y_train.describe()
y_test_stats = y_test.describe()

target_summary = pd.concat([y_train_stats, y_test_stats], axis=1)
target_summary.columns = ["Train Summary", "Test Summary"]

print("\nTarget Variable Summary Statistics\n")
print(target_summary)
Dataset Split Summary

Dataset      Shape
x_train (8233, 17)
 x_test (2059, 17)
y_train  (8233, 1)
 y_test  (2059, 1)
------------------------------------

Target Variable Summary Statistics

       Train Summary  Test Summary
count    8233.000000   2059.000000
mean     8753.105794   9023.986401
std      4072.237063   4018.368215
min      1759.000000   1965.000000
25%      5192.000000   5403.000000
50%      8016.000000   8586.000000
75%     12127.000000  12373.000000
max     23001.000000  22294.000000

What data splitting ratio have you used and why?

  • Train Set - 80
  • Test Set - 20

7. Task-2 - ML Model Implementation¶

7.1. Analyze Model¶

In [56]:
# ===== Regression Evaluation Function =====
def analyze_regression_model(model, X_train, y_train, X_test, y_test):
    """
    Evaluate a regression model and visualize results with compact plots,
    including comprehensive metrics and diagnostic charts.
    """
    # ===== Flatten target variables and ensure numeric =====
    y_train = pd.to_numeric(y_train.squeeze(), errors='coerce')
    y_test = pd.to_numeric(y_test.squeeze(), errors='coerce')

    # ===== Train Model =====
    start_time = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred_train = model.predict(X_train)
    y_pred = model.predict(X_test)

    # ===== Metrics Calculation =====
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    evs = explained_variance_score(y_test, y_pred)

    # ===== MAPE calculation =====
    try:
        mape = mean_absolute_percentage_error(y_test, y_pred)
    except:
        mape = None

    # ===== Cross-validation scores =====
    try:
        cv_r2 = cross_val_score(model, X_train, y_train, cv=KFold(5, shuffle=True, random_state=42), scoring='r2', n_jobs=-1).mean()
        cv_rmse = -cross_val_score(model, X_train, y_train, cv=KFold(5, shuffle=True, random_state=42), scoring='neg_root_mean_squared_error', n_jobs=-1).mean()
    except:
        cv_r2 = None
        cv_rmse = None

    # ===== Residuals =====
    residuals = y_test - y_pred

    # ===== Metrics dictionary =====
    metrics = {
        "Training R²": round(r2_score(y_train, y_pred_train), 4),
        "Test R²": round(r2, 4),
        "Overfit (Train - Test R²)": round(r2_score(y_train, y_pred_train) - r2, 4),
        "RMSE": round(rmse, 4),
        "MAE": round(mae, 4),
        "MSE": round(mse, 4),
        "Explained Variance": round(evs, 4),
        "Cross-Validation R²": round(cv_r2, 4) if cv_r2 else "N/A",
        "Cross-Validation RMSE": round(cv_rmse, 4) if cv_rmse else "N/A",
        "Training Time (sec)": round(train_time, 3),
        "Samples (Train/Test)": f"{len(X_train)}/{len(X_test)}"
    }
    if mape is not None:
        metrics["MAPE (%)"] = round(mape * 100, 2)

    # ===== Visualization =====
    fig, axes = plt.subplots(3, 2, figsize=(18, 12))
    fig.suptitle(
        f"Regression Model Evaluation: {model.__class__.__name__}\n"
        f"Test R²: {metrics['Test R²']} | CV R²: {metrics['Cross-Validation R²']} | RMSE: {metrics['RMSE']}",
        fontsize=15, weight="bold", color="darkblue"
    )

    # ===== 1. Key Metrics Bar Chart =====
    key_metrics = {k: v for k, v in metrics.items() if k in ["Training R²", "Test R²", "RMSE", "MAE", "Explained Variance"]}
    metrics_df = pd.DataFrame(list(key_metrics.items()), columns=["Metric", "Value"])
    colors = ["orange", "purple", "red", "blue", "green"][:len(metrics_df)]
    bars = axes[0, 0].barh(metrics_df["Metric"], metrics_df["Value"].astype(float), color=colors)
    axes[0, 0].set_title("Key Performance Metrics", fontsize=12, weight="bold")
    x_max = max(metrics_df["Value"].astype(float)) * 1.2
    axes[0, 0].set_xlim(0, x_max)
    for bar in bars:
        width = bar.get_width()
        axes[0, 0].text(width + 0.01, bar.get_y() + bar.get_height()/2, f'{width:.3f}', ha='left', va='center', fontsize=9)

    # ===== 2. Actual vs Predicted Scatter Plot =====
    axes[0, 1].scatter(y_test, y_pred, alpha=0.6, color='blue')
    max_val = max(np.max(y_test), np.max(y_pred))
    min_val = min(np.min(y_test), np.min(y_pred))
    axes[0, 1].plot([min_val, max_val], [min_val, max_val], 'r--', alpha=0.8)
    axes[0, 1].set_xlabel("Actual Values")
    axes[0, 1].set_ylabel("Predicted Values")
    axes[0, 1].set_title("Actual vs Predicted Values", fontsize=12, weight="bold")
    axes[0, 1].text(0.05, 0.95, f'Test R² = {r2:.3f}', transform=axes[0, 1].transAxes, fontsize=12, bbox=dict(boxstyle="round,pad=0.3", facecolor="white"))

    # ===== 3. Residuals Plot =====
    axes[1, 0].scatter(y_pred, residuals, alpha=0.6, color='green')
    axes[1, 0].axhline(y=0, color='red', linestyle='--', alpha=0.8)
    axes[1, 0].set_xlabel("Predicted Values")
    axes[1, 0].set_ylabel("Residuals")
    axes[1, 0].set_title("Residuals vs Predicted Values", fontsize=12, weight="bold")

    # ===== 4. Additional Metrics Table =====
    axes[1, 1].axis('off')
    additional_metrics = {
        "Train R²": metrics["Training R²"],
        "Cross-Val R²": metrics["Cross-Validation R²"],
        "Cross-Val RMSE": metrics["Cross-Validation RMSE"],
        "Overfit (R² diff)": metrics["Overfit (Train - Test R²)"],
        "Train Time": f"{metrics['Training Time (sec)']}s",
        "Samples": metrics["Samples (Train/Test)"]
    }
    if "MAPE (%)" in metrics:
        additional_metrics["MAPE (%)"] = metrics["MAPE (%)"]
    table_data = [[k, v] for k, v in additional_metrics.items()]
    axes[1, 1].set_title("Additional Metrics", fontsize=12, weight="bold", pad=15, color="black")
    table = axes[1, 1].table(cellText=table_data, cellLoc='center', colLabels=["Metric", "Value"], loc='center', bbox=[0.1, 0.3, 0.9, 0.6])
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 1.5)
    for (row, col), cell in table.get_celld().items():
        if row == 0:
            cell.set_facecolor("#6A0DAD")
            cell.set_text_props(weight='bold', color="white")
        else:
            if row % 2 == 0:
                cell.set_facecolor("#E6E6FA")
            else:
                cell.set_facecolor("white")

    # ===== 5. Residuals Distribution =====
    axes[2, 0].hist(residuals, bins=30, alpha=0.7, color='orange', edgecolor='black')
    axes[2, 0].axvline(x=0, color='red', linestyle='--', alpha=0.8)
    axes[2, 0].set_xlabel("Residuals")
    axes[2, 0].set_ylabel("Frequency")
    axes[2, 0].set_title("Residuals Distribution", fontsize=12, weight="bold")
    try:
        stat, p_value = stats.normaltest(residuals)
        axes[2, 0].text(0.95, 0.95, f'Normality p-value: {p_value:.3f}', transform=axes[2, 0].transAxes, ha='right', va='top', fontsize=10, bbox=dict(boxstyle="round,pad=0.3", facecolor="white"))
    except:
        pass

    # ===== 6. Error Metrics Comparison =====
    error_metrics = {k: v for k, v in metrics.items() if k in ["RMSE", "MAE", "MSE"]}
    if "MAPE (%)" in metrics:
        error_metrics["MAPE (%)"] = metrics["MAPE (%)"]
    error_df = pd.DataFrame(list(error_metrics.items()), columns=["Metric", "Value"])
    error_df.plot(kind="barh", x="Metric", y="Value", ax=axes[2, 1], color="skyblue", legend=False)
    axes[2, 1].set_title("Error Metrics Comparison", fontsize=12, weight="bold")
    for i, v in enumerate(error_df["Value"]):
        axes[2, 1].text(v + 0.01, i, f'{v:.3f}', va='center')

    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()

    return metrics

7.1.1. ML Model - 1. Linear Regression¶

Chart-17. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [57]:
# ===== Fitting Linear Regression Model =====
model_lr = LinearRegression()

# ===== Analyzing the model and visualizing evaluation metrics =====
metrics = analyze_regression_model(model_lr, x_train, y_train, x_test, y_test)

print("\nRegression Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
No description has been provided for this image
Regression Metrics Summary:
Training R²: 0.648
Test R²: 0.6344
Overfit (Train - Test R²): 0.0136
RMSE: 2429.2217
MAE: 1834.1093
MSE: 5901117.9675
Explained Variance: 0.6351
Cross-Validation R²: 0.6458
Cross-Validation RMSE: 2421.2291
Training Time (sec): 0.025
Samples (Train/Test): 8233/2059
MAPE (%): 22.63

7.1.2. ML Model - 2. Ridge Regression¶

Chart-18. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [58]:
# ===== Fitting Ridge Regression Model =====
model_ridge = Ridge(alpha=1.0, random_state=0)

# ===== Analyzing the model and visualizing evaluation metrics =====
metrics = analyze_regression_model(model_ridge, x_train, y_train, x_test, y_test)

print("\nRegression Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
No description has been provided for this image
Regression Metrics Summary:
Training R²: 0.648
Test R²: 0.6342
Overfit (Train - Test R²): 0.0138
RMSE: 2429.7687
MAE: 1834.9187
MSE: 5903775.7912
Explained Variance: 0.635
Cross-Validation R²: 0.6458
Cross-Validation RMSE: 2421.2519
Training Time (sec): 0.027
Samples (Train/Test): 8233/2059
MAPE (%): 22.63

7.1.3. ML Model - 3. Lasso Regression¶

Chart-19. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [59]:
# ===== Fitting Lasso Regression Model =====
model_lasso = Lasso(alpha=0.01, max_iter=10000, random_state=42)

# ===== Analyzing the model and visualizing evaluation metrics =====
metrics = analyze_regression_model(model_lasso, x_train, y_train, x_test, y_test)

print("\nRegression Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
No description has been provided for this image
Regression Metrics Summary:
Training R²: 0.648
Test R²: 0.6344
Overfit (Train - Test R²): 0.0136
RMSE: 2429.2435
MAE: 1834.1461
MSE: 5901224.0172
Explained Variance: 0.6351
Cross-Validation R²: 0.6458
Cross-Validation RMSE: 2421.2299
Training Time (sec): 0.042
Samples (Train/Test): 8233/2059
MAPE (%): 22.63

7.1.4. ML Model - 4. Random Forest Regression¶

Chart-20. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [60]:
# ===== Fitting Random Forest Regression Model =====
rf_model = RandomForestRegressor(
    n_estimators=300,     # more trees for stability
    max_depth=18,         # limit depth to avoid overfitting
    min_samples_split=5,  # more samples needed to split → less variance
    min_samples_leaf=2,   # larger leaves → smoother predictions
    bootstrap=True,       # use bootstrapping for diversity
    random_state=1,
    n_jobs=-1
)

# ===== Analyzing the model and visualizing evaluation metrics =====
metrics = analyze_regression_model(rf_model, x_train, y_train, x_test, y_test)

print("\nRegression Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
No description has been provided for this image
Regression Metrics Summary:
Training R²: 0.9281
Test R²: 0.817
Overfit (Train - Test R²): 0.1111
RMSE: 1718.6327
MAE: 1129.6432
MSE: 2953698.4997
Explained Variance: 0.8171
Cross-Validation R²: 0.8272
Cross-Validation RMSE: 1690.5666
Training Time (sec): 5.222
Samples (Train/Test): 8233/2059
MAPE (%): 12.46

7.1.5. ML Model - 5. XGBoost Regression¶

Chart-21. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [61]:
# ===== Fitting XGBoost Regression Model =====
xgb_model = XGBRegressor(
    n_estimators=300,       # number of boosting rounds
    max_depth=6,            # tree depth
    learning_rate=0.1,      # step size shrinkage
    subsample=0.8,          # row sampling
    colsample_bytree=0.8,   # feature sampling
    min_child_weight=2,     # similar to min_samples_leaf
    reg_lambda=1.0,         # L2 regularization
    reg_alpha=0.0,          # L1 regularization
    random_state=1,
    n_jobs=-1
)

# ===== Analyzing the model and visualizing evaluation metrics =====
metrics = analyze_regression_model(xgb_model, x_train, y_train, x_test, y_test)

print("\nRegression Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
No description has been provided for this image
Regression Metrics Summary:
Training R²: 0.9222
Test R²: 0.8419
Overfit (Train - Test R²): 0.0803
RMSE: 1597.3493
MAE: 1127.7738
MSE: 2551524.7515
Explained Variance: 0.8421
Cross-Validation R²: 0.8436
Cross-Validation RMSE: 1608.8623
Training Time (sec): 0.426
Samples (Train/Test): 8233/2059
MAPE (%): 12.63

7.1.6. ML Model - 6. LightGBM Regression¶

Chart-22. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [62]:
# ===== Fitting LightGBM Regressor Model =====
lgbm_model = LGBMRegressor(
    n_estimators=250,       # boosting iterations
    max_depth=-1,           # no limit (let the tree grow)
    learning_rate=0.05,     # smaller LR → more stable, combine with higher n_estimators
    num_leaves=31,          # controls complexity
    subsample=0.8,          # row sampling
    colsample_bytree=0.8,   # feature sampling
    reg_lambda=1.0,         # L2 regularization
    reg_alpha=0.0,          # L1 regularization
    random_state=1,
    n_jobs=-1
)

# ===== Analyzing the model and visualizing evaluation metrics =====
metrics = analyze_regression_model(lgbm_model, x_train, y_train, x_test, y_test)

print("\nRegression Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000527 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 569
[LightGBM] [Info] Number of data points in the train set: 8233, number of used features: 14
[LightGBM] [Info] Start training from score 8753.105794
No description has been provided for this image
Regression Metrics Summary:
Training R²: 0.8806
Test R²: 0.8343
Overfit (Train - Test R²): 0.0463
RMSE: 1635.1335
MAE: 1189.8759
MSE: 2673661.4197
Explained Variance: 0.8346
Cross-Validation R²: 0.8395
Cross-Validation RMSE: 1630.136
Training Time (sec): 0.325
Samples (Train/Test): 8233/2059
MAPE (%): 13.59

7.2. Hyperparameter Tuning¶

In [63]:
# ===== Regression Evaluation Function =====
# ===== Cross-Validation & Hyperparameter =====
def hyperparameter_tune(model_name, model, param_grid, X_train, y_train, X_test, y_test, n_iter=20, cv=3, use_proba=True):

    # ===== Flatten target variables and ensure numeric =====
    y_train = pd.to_numeric(y_train.squeeze(), errors='coerce')
    y_test = pd.to_numeric(y_test.squeeze(), errors='coerce')

    # Check for NaN values after conversion
    if y_train.isna().any() or y_test.isna().any():
        print("Warning: NaN values found in target variables after conversion")
        y_train = y_train.dropna()
        y_test = y_test.dropna()
        # Also filter corresponding X data
        X_train = X_train.loc[y_train.index]
        X_test = X_test.loc[y_test.index]

    # ===== Hyperparameter tuning =====
    start_time = time.time()
    search = RandomizedSearchCV(
        estimator=model,
        param_distributions=param_grid,
        n_iter=n_iter,
        scoring='r2',
        cv=cv,
        n_jobs=-1,
        verbose=2,
        random_state=42
    )
    search.fit(X_train, y_train)

    best_params = search.best_params_
    best_model = model.set_params(**best_params)
    best_model.fit(X_train, y_train)
    train_time = time.time() - start_time

    # ===== Predictions with best model =====
    y_pred_train = best_model.predict(X_train)
    y_pred = best_model.predict(X_test)

    # ===== Metrics Calculation =====
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    evs = explained_variance_score(y_test, y_pred)

    # ===== MAPE calculation =====
    try:
        mape = mean_absolute_percentage_error(y_test, y_pred)
    except:
        mape = None

    # ===== Cross-validation scores =====
    try:
        cv_r2 = cross_val_score(best_model, X_train, y_train, cv=KFold(5, shuffle=True, random_state=42), scoring='r2', n_jobs=-1).mean()
        cv_rmse = -cross_val_score(best_model, X_train, y_train, cv=KFold(5, shuffle=True, random_state=42), scoring='neg_root_mean_squared_error', n_jobs=-1).mean()
    except:
        cv_r2 = None
        cv_rmse = None

    # ===== Residuals =====
    residuals = y_test - y_pred

    # ===== Metrics dictionary =====
    metrics = {
        "Training R²": round(r2_score(y_train, y_pred_train), 4),
        "Test R²": round(r2, 4),
        "Overfit (Train - Test R²)": round(r2_score(y_train, y_pred_train) - r2, 4),
        "RMSE": round(rmse, 4),
        "MAE": round(mae, 4),
        "MSE": round(mse, 4),
        "Explained Variance": round(evs, 4),
        "Cross-Validation R²": round(cv_r2, 4) if cv_r2 is not None else "N/A",
        "Cross-Validation RMSE": round(cv_rmse, 4) if cv_rmse is not None else "N/A",
        "Training Time (sec)": round(train_time, 3),
        "Samples (Train/Test)": f"{len(X_train)}/{len(X_test)}",
        "Best Parameters": best_params
    }
    if mape is not None:
        metrics["MAPE (%)"] = round(mape * 100, 2)

    # ===== Visualization =====
    fig, axes = plt.subplots(3, 2, figsize=(18, 12))
    fig.suptitle(
        f"Hyperparameters-Tuning Model Evaluation:: {model.__class__.__name__}\n"
        f"Test R²: {metrics['Test R²']} | CV R²: {metrics['Cross-Validation R²']} | RMSE: {metrics['RMSE']}",
        fontsize=15, weight="bold", color="darkblue"
    )

    # ===== 1. Key Metrics Bar Chart =====
    key_metrics = {k: v for k, v in metrics.items() if k in ["Training R²", "Test R²", "RMSE", "MAE", "Explained Variance"]}
    metrics_df = pd.DataFrame(list(key_metrics.items()), columns=["Metric", "Value"])
    # Filter out non-numeric values
    metrics_df = metrics_df[metrics_df["Value"].apply(lambda x: isinstance(x, (int, float)))]
    if not metrics_df.empty:
        colors = ["red", "blue", "green", "orange", "purple"][:len(metrics_df)]
        bars = axes[0, 0].barh(metrics_df["Metric"], metrics_df["Value"].astype(float), color=colors)
        axes[0, 0].set_title("Key Performance Metrics", fontsize=12, weight="bold")
        x_max = max(metrics_df["Value"].astype(float)) * 1.2
        axes[0, 0].set_xlim(0, x_max)
        for bar in bars:
            width = bar.get_width()
            axes[0, 0].text(width + 0.01, bar.get_y() + bar.get_height()/2, f'{width:.3f}', ha='left', va='center', fontsize=9)
    else:
        axes[0, 0].text(0.5, 0.5, "No numeric metrics available", ha='center', va='center')
        axes[0, 0].set_title("Key Performance Metrics", fontsize=12, weight="bold")

    # ===== 2. Actual vs Predicted Scatter Plot =====
    axes[0, 1].scatter(y_test, y_pred, alpha=0.6, color='blue')
    max_val = max(np.max(y_test), np.max(y_pred))
    min_val = min(np.min(y_test), np.min(y_pred))
    axes[0, 1].plot([min_val, max_val], [min_val, max_val], 'r--', alpha=0.8)
    axes[0, 1].set_xlabel("Actual Values")
    axes[0, 1].set_ylabel("Predicted Values")
    axes[0, 1].set_title("Actual vs Predicted Values", fontsize=12, weight="bold")
    axes[0, 1].text(0.05, 0.95, f'Test R² = {r2:.3f}', transform=axes[0, 1].transAxes, fontsize=12, bbox=dict(boxstyle="round,pad=0.3", facecolor="white"))

    # ===== 3. Residuals Plot =====
    axes[1, 0].scatter(y_pred, residuals, alpha=0.6, color='green')
    axes[1, 0].axhline(y=0, color='red', linestyle='--', alpha=0.8)
    axes[1, 0].set_xlabel("Predicted Values")
    axes[1, 0].set_ylabel("Residuals")
    axes[1, 0].set_title("Residuals vs Predicted Values", fontsize=12, weight="bold")

    # ===== 4. Additional Metrics Table =====
    axes[1, 1].axis('off')
    additional_metrics = {
        "Train R²": metrics["Training R²"],
        "Cross-Val R²": metrics["Cross-Validation R²"],
        "Cross-Val RMSE": metrics["Cross-Validation RMSE"],
        "Overfit (R² diff)": metrics["Overfit (Train - Test R²)"],
        "Train Time": f"{metrics['Training Time (sec)']}s",
        "Samples": metrics["Samples (Train/Test)"]
    }
    if "MAPE (%)" in metrics:
        additional_metrics["MAPE (%)"] = metrics["MAPE (%)"]
    table_data = [[k, v] for k, v in additional_metrics.items()]
    axes[1, 1].set_title("Additional Metrics", fontsize=12, weight="bold", pad=15, color="black")
    table = axes[1, 1].table(cellText=table_data, cellLoc='center', colLabels=["Metric", "Value"], loc='center', bbox=[0.1, 0.3, 0.9, 0.6])
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 1.5)
    for (row, col), cell in table.get_celld().items():
        if row == 0:
            cell.set_facecolor("#6A0DAD")
            cell.set_text_props(weight='bold', color="white")
        else:
            if row % 2 == 0:
                cell.set_facecolor("#E6E6FA")
            else:
                cell.set_facecolor("white")

    # ===== 5. Residuals Distribution =====
    axes[2, 0].hist(residuals, bins=30, alpha=0.7, color='navy', edgecolor='black')
    axes[2, 0].axvline(x=0, color='red', linestyle='--', alpha=0.8)
    axes[2, 0].set_xlabel("Residuals")
    axes[2, 0].set_ylabel("Frequency")
    axes[2, 0].set_title("Residuals Distribution", fontsize=12, weight="bold")
    try:
        stat, p_value = stats.normaltest(residuals)
        axes[2, 0].text(0.95, 0.95, f'Normality p-value: {p_value:.3f}', transform=axes[2, 0].transAxes, ha='right', va='top', fontsize=10, bbox=dict(boxstyle="round,pad=0.3", facecolor="white"))
    except:
        pass

    # ===== 6. Error Metrics Comparison =====
    error_metrics = {k: v for k, v in metrics.items() if k in ["RMSE", "MAE", "MSE"]}
    if "MAPE (%)" in metrics:
        error_metrics["MAPE (%)"] = metrics["MAPE (%)"]
    # Filter out non-numeric values
    error_metrics = {k: v for k, v in error_metrics.items() if isinstance(v, (int, float))}
    if error_metrics:
        error_df = pd.DataFrame(list(error_metrics.items()), columns=["Metric", "Value"])
        error_df.plot(kind="barh", x="Metric", y="Value", ax=axes[2, 1], color="red", legend=False)
        axes[2, 1].set_title("Error Metrics Comparison", fontsize=12, weight="bold")
        for i, v in enumerate(error_df["Value"]):
            axes[2, 1].text(v + 0.01, i, f'{v:.3f}', va='center')
    else:
        axes[2, 1].text(0.5, 0.5, "No numeric error metrics available", ha='center', va='center')
        axes[2, 1].set_title("Error Metrics Comparison", fontsize=12, weight="bold")

    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()

    return best_model, best_params, metrics

The Hyperparameter tuning for LightGBM, RandomForest, and XGBoost reflects strategic adjustments to optimize each model for the prediction of term deposit subscriptions. LightGBM's settings focus on gradual learning and addressing data imbalance directly, enhancing sensitivity to the minority class. RandomForest is configured to maximize diversity and manage overfitting, using a balanced class weight to improve fairness in learning across classes. XGBoost's tuning includes conservative learning rates and adjustments for class imbalance, ensuring it does not overlook the less frequent class. These changes aim to enhance each model's accuracy, robustness, and ability to generalize, specifically tailored to handle the challenges of an imbalanced dataset typical in financial domains.

7.2.1. Hyperparameter Tuning - 1. RandomForest Regressor¶

Chart-23. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [64]:
# ===== Fitting RandomForestRegressor Model =====
model_rf_hpt = RandomForestRegressor(
    n_estimators=500,        # start with higher number of trees
    max_depth=None,          # let trees grow fully
    min_samples_split=2,     # minimal split
    min_samples_leaf=1,      # minimal leaf
    max_features='sqrt',     # common choice
    random_state=6,
    n_jobs=-1
)

# ===== Hyperparameter grid =====
rf_param_grid = {
    'n_estimators': [200, 500, 800],       # try more trees
    'max_depth': [10, 20, 30, None],       # deeper trees
    'min_samples_split': [2, 5, 10],       # regularization
    'min_samples_leaf': [1, 2, 4, 8],      # smoother predictions
    'max_features': ['sqrt', 0.8],         # feature selection
    'bootstrap': [True, False]             # sampling method
}

# ===== Analysing the model and Visualizing evaluation Metric Score chart =====
best_rf_model, best_params, metrics = hyperparameter_tune("RandomForestRegressor", model_rf_hpt, rf_param_grid, x_train, y_train, x_test, y_test, n_iter=5, cv=3)

print("\nHyperparameters-Tuning Model Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
Fitting 3 folds for each of 5 candidates, totalling 15 fits
No description has been provided for this image
Hyperparameters-Tuning Model Metrics Summary:
Training R²: 0.9015
Test R²: 0.8223
Overfit (Train - Test R²): 0.0792
RMSE: 1693.697
MAE: 1141.7828
MSE: 2868609.6685
Explained Variance: 0.8224
Cross-Validation R²: 0.8308
Cross-Validation RMSE: 1673.4133
Training Time (sec): 44.216
Samples (Train/Test): 8233/2059
Best Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 0.8, 'max_depth': None, 'bootstrap': True}
MAPE (%): 12.62

7.2.2. Hyperparameter Tuning - 2. XG Boost Regressor¶

Chart-24. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [65]:
# ===== Fitting XGBRegressor Model =====
model_xgb_hpt = XGBRegressor(
    objective='reg:squarederror',   # regression objective
    eval_metric='rmse',             # RMSE as metric
    random_state=8,
    n_jobs=-1,
    tree_method="hist"          # faster training
)

# ===== Hyperparameter Grid =====
xgb_param_grid = {
    'n_estimators': [500, 800, 1000],        # more trees for stability
    'learning_rate': [0.01, 0.05],           # slower learning for better generalization
    'max_depth': [6, 8],                     # moderate depth (avoids shallow underfit)
    'min_child_weight': [1, 3, 5],           # controls leaf size → helps reduce overfitting
    'subsample': [0.8, 0.9],                 # row sampling (regularization)
    'colsample_bytree': [0.8, 0.9],          # feature sampling
    'gamma': [0, 0.1],                       # min loss reduction
    'reg_alpha': [0, 0.01, 0.1],             # L1 regularization
    'reg_lambda': [1, 2]                     # L2 regularization
}


# ===== Analysing the model and Visualizing evaluation Metric Score chart =====
best_rf_model, best_params, metrics = hyperparameter_tune("XGBRegressor", model_xgb_hpt, xgb_param_grid, x_train, y_train, x_test, y_test, n_iter=5, cv=3)

print("\nHyperparameters-Tuning Model Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
Fitting 3 folds for each of 5 candidates, totalling 15 fits
No description has been provided for this image
Hyperparameters-Tuning Model Metrics Summary:
Training R²: 0.9162
Test R²: 0.8437
Overfit (Train - Test R²): 0.0726
RMSE: 1588.3808
MAE: 1128.9846
MSE: 2522953.4853
Explained Variance: 0.8439
Cross-Validation R²: 0.8465
Cross-Validation RMSE: 1594.2372
Training Time (sec): 20.134
Samples (Train/Test): 8233/2059
Best Parameters: {'subsample': 0.8, 'reg_lambda': 2, 'reg_alpha': 0.1, 'n_estimators': 500, 'min_child_weight': 3, 'max_depth': 6, 'learning_rate': 0.05, 'gamma': 0, 'colsample_bytree': 0.9}
MAPE (%): 12.7

7.2.3. Hyperparameter Tuning - 3. LightGBM Regression¶

Chart-25. Explain the ML Model and it's performance using Evaluation metric Score Chart¶

In [66]:
# ===== Fitting LightGBM Regressor Model =====
lgbm_model_hpt = LGBMRegressor(
    n_estimators=250,       # boosting iterations
    max_depth=-1,           # no limit (let the tree grow)
    learning_rate=0.05,     # smaller LR → more stable, combine with higher n_estimators
    num_leaves=31,          # controls complexity
    subsample=0.8,          # row sampling
    colsample_bytree=0.8,   # feature sampling
    reg_lambda=1.0,         # L2 regularization
    reg_alpha=0.0,          # L1 regularization
    random_state=1,
    n_jobs=-1
)

# ===== Hyperparameter grid =====
lgbm_param_grid = {
    'n_estimators': [200, 400, 600],       # boosting rounds
    'learning_rate': [0.01, 0.05, 0.1],    # step size shrinkage
    'max_depth': [-1, 6, 10, 15],          # tree depth
    'num_leaves': [31, 63, 127],           # larger → more complex model
    'min_child_samples': [10, 20, 50],     # minimum samples per leaf
    'subsample': [0.7, 0.8, 1.0],          # row sampling
    'colsample_bytree': [0.7, 0.8, 1.0],   # feature sampling
    'reg_alpha': [0, 0.1, 1.0],            # L1 regularization
    'reg_lambda': [0.5, 1.0, 2.0],         # L2 regularization
}

# ===== Analysing the model and Visualizing evaluation Metric Score chart =====
best_rf_model, best_params, metrics = hyperparameter_tune("LGBMRegressor", lgbm_model_hpt, lgbm_param_grid, x_train, y_train, x_test, y_test, n_iter=5, cv=3)

print("\nHyperparameters-Tuning Model Metrics Summary:")
for k, v in metrics.items():
    print(f"{k}: {v}")
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000213 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 569
[LightGBM] [Info] Number of data points in the train set: 8233, number of used features: 14
[LightGBM] [Info] Start training from score 8753.105794
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000243 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 569
[LightGBM] [Info] Number of data points in the train set: 8233, number of used features: 14
[LightGBM] [Info] Start training from score 8753.105794
No description has been provided for this image
Hyperparameters-Tuning Model Metrics Summary:
Training R²: 0.9192
Test R²: 0.846
Overfit (Train - Test R²): 0.0732
RMSE: 1576.467
MAE: 1105.8847
MSE: 2485248.1513
Explained Variance: 0.8462
Cross-Validation R²: 0.846
Cross-Validation RMSE: 1596.4681
Training Time (sec): 17.563
Samples (Train/Test): 8233/2059
Best Parameters: {'subsample': 1.0, 'reg_lambda': 0.5, 'reg_alpha': 0, 'num_leaves': 31, 'n_estimators': 400, 'min_child_samples': 20, 'max_depth': -1, 'learning_rate': 0.1, 'colsample_bytree': 1.0}
MAPE (%): 12.26

8. Model Evaluation¶

8.1. ML Model comparision & Interpretation¶

8.1.1. Model comparision:¶

In [67]:
# ===== Store results =====
results = {
    "Linear Regression": {
        'Training R²': 0.648,
        'Test R²': 0.6344,
        'Overfit (Train - Test R²)': 0.0136,
        'RMSE': 2429.2217,
        'MAE': 1834.1093,
        'MSE': 5901117.9675,
        'Explained Variance': 0.6351,
        'Cross-Validation R²': 0.6458,
        'Cross-Validation RMSE': 2421.2291,
        'Training Time (sec)': 0.024,
        'Samples (Train/Test)': (8233, 2059),
        'MAPE (%)': 22.63
    },
    "Ridge Regression": {
        'Training R²': 0.648,
        'Test R²': 0.6342,
        'Overfit (Train - Test R²)': 0.0138,
        'RMSE': 2429.7687,
        'MAE': 1834.9187,
        'MSE': 5903775.7912,
        'Explained Variance': 0.635,
        'Cross-Validation R²': 0.6458,
        'Cross-Validation RMSE': 2421.2519,
        'Training Time (sec)': 0.087,
        'Samples (Train/Test)': (8233, 2059),
        'MAPE (%)': 22.63
    },
    "Lasso Regression": {
        'Training R²': 0.648,
        'Test R²': 0.6344,
        'Overfit (Train - Test R²)': 0.0136,
        'RMSE': 2429.2435,
        'MAE': 1834.1461,
        'MSE': 5901224.0172,
        'Explained Variance': 0.6351,
        'Cross-Validation R²': 0.6458,
        'Cross-Validation RMSE': 2421.2299,
        'Training Time (sec)': 0.045,
        'Samples (Train/Test)': (8233, 2059),
        'MAPE (%)': 22.63
    },
    "Random Forest Regression": {
        'Training R²': 0.9281,
        'Test R²': 0.817,
        'Overfit (Train - Test R²)': 0.1111,
        'RMSE': 1718.6327,
        'MAE': 1129.6432,
        'MSE': 2953698.4997,
        'Explained Variance': 0.8171,
        'Cross-Validation R²': 0.8272,
        'Cross-Validation RMSE': 1690.5666,
        'Training Time (sec)': 7.1,
        'Samples (Train/Test)': (8233, 2059),
        'MAPE (%)': 12.46
    },
    "XGBoost Regression": {
        'Training R²': 0.9222,
        'Test R²': 0.8419,
        'Overfit (Train - Test R²)': 0.0803,
        'RMSE': 1597.3493,
        'MAE': 1127.7738,
        'MSE': 2551524.7515,
        'Explained Variance': 0.8421,
        'Cross-Validation R²': 0.8436,
        'Cross-Validation RMSE': 1608.8623,
        'Training Time (sec)': 0.52,
        'Samples (Train/Test)': (8233, 2059),
        'MAPE (%)': 12.63
    },
    "LightGBM Regression": {
        'Training R²': 0.8806,
        'Test R²': 0.8343,
        'Overfit (Train - Test R²)': 0.0463,
        'RMSE': 1635.1335,
        'MAE': 1189.8759,
        'MSE': 2673661.4197,
        'Explained Variance': 0.8346,
        'Cross-Validation R²': 0.8395,
        'Cross-Validation RMSE': 1630.136,
        'Training Time (sec)': 0.372,
        'Samples (Train/Test)': (8233, 2059),
        'MAPE (%)': 13.59
    }
}

# ===== Convert to DataFrame =====
df_results = pd.DataFrame(results).T
print("\n=== Model Comparison Table ===")
df_results
=== Model Comparison Table ===
Out[67]:
Training R² Test R² Overfit (Train - Test R²) RMSE MAE MSE Explained Variance Cross-Validation R² Cross-Validation RMSE Training Time (sec) Samples (Train/Test) MAPE (%)
Linear Regression 0.648 0.6344 0.0136 2429.2217 1834.1093 5901117.9675 0.6351 0.6458 2421.2291 0.024 (8233, 2059) 22.63
Ridge Regression 0.648 0.6342 0.0138 2429.7687 1834.9187 5903775.7912 0.635 0.6458 2421.2519 0.087 (8233, 2059) 22.63
Lasso Regression 0.648 0.6344 0.0136 2429.2435 1834.1461 5901224.0172 0.6351 0.6458 2421.2299 0.045 (8233, 2059) 22.63
Random Forest Regression 0.9281 0.817 0.1111 1718.6327 1129.6432 2953698.4997 0.8171 0.8272 1690.5666 7.1 (8233, 2059) 12.46
XGBoost Regression 0.9222 0.8419 0.0803 1597.3493 1127.7738 2551524.7515 0.8421 0.8436 1608.8623 0.52 (8233, 2059) 12.63
LightGBM Regression 0.8806 0.8343 0.0463 1635.1335 1189.8759 2673661.4197 0.8346 0.8395 1630.136 0.372 (8233, 2059) 13.59

8.1.2. ML Model Plot comparision¶

Chart-26. Evaluating and Comparing Model Performance Scores¶

In [68]:
# ===== Comparing Model Performance Scores =====
def add_labels(ax, decimals=3, threshold=0.05):
    """Add labels to bar chart with proper positioning."""
    y_lim = ax.get_ylim()[1]
    for p in ax.patches:
        value = p.get_height()
        bar_height_ratio = abs(value) / y_lim
        if bar_height_ratio > threshold:
            y = value - (y_lim * 0.02)
            va = 'top'
        else:
            y = value + (y_lim * 0.01)
            va = 'bottom'
        ax.text(
            p.get_x() + p.get_width() / 2., y,
            f"{value:.{decimals}f}",
            ha='center', va=va, fontsize=9,
            color="black", fontweight="bold"
        )

# ===== 1. Performance Metrics =====
performance_metrics = ["Training R²", "Test R²", "Overfit (Train - Test R²)", "Explained Variance"]
plot_perf = df_results[performance_metrics]
ax1 = plot_perf.plot(kind='bar', figsize=(20, 4), width=0.8, colormap="Blues")
plt.title("Performance Metrics", fontsize=16, fontweight='bold')
plt.ylabel("Score", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels(ax1, decimals=3)
plt.tight_layout()
plt.show()

# ===== 2. Error Metrics =====
error_metrics = ["RMSE", "MAE"]
plot_error = df_results[error_metrics]
ax2 = plot_error.plot(kind='bar', figsize=(20, 4), width=0.6, colormap="Reds")
plt.title("Error Metrics", fontsize=16, fontweight='bold')
plt.ylabel("Error Value", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels(ax2, decimals=2)
plt.tight_layout()
plt.show()

# ===== 3. Percentage Error =====
percent_metrics = ["MAPE (%)"]
plot_percent = df_results[percent_metrics]
ax3 = plot_percent.plot(kind='bar', figsize=(20, 4), width=0.4, colormap="viridis")
plt.title("Percentage Error", fontsize=16, fontweight='bold')
plt.ylabel("Percentage (%)", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels(ax3, decimals=2)
plt.tight_layout()
plt.show()

# ===== 4. Training Time =====
time_metrics = ["Training Time (sec)"]
plot_time = df_results[time_metrics]
ax4 = plot_time.plot(kind='bar', figsize=(20, 4), width=0.4, colormap="Wistia")
plt.title("Training Time", fontsize=16, fontweight='bold')
plt.ylabel("Seconds", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels(ax4, decimals=3)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Insights:¶

  • Linear, Ridge, and Lasso Regression perform nearly the same with Training R² ≈ 0.648 and Test R² ≈ 0.634, indicating underfitting.

  • These linear models also have the highest errors (RMSE ≈ 2430, MAE ≈ 1834, MAPE ≈ 22.6%), making them unsuitable.

  • Random Forest Regression achieves the highest Training R² (0.928) but drops to Test R² = 0.817, showing overfitting (gap = 0.111).

  • Random Forest reduces error significantly compared to linear models (RMSE ≈ 1719, MAE ≈ 1130, MAPE ≈ 12.5%).

  • XGBoost Regression shows the best Test R² (0.842) with Training R² = 0.922, striking a good balance between accuracy and overfitting.

  • XGBoost also gives the lowest errors (RMSE ≈ 1597, MAE ≈ 1128, MAPE ≈ 12.6%), making it the top-performing model overall.

  • LightGBM Regression performs slightly below XGBoost with Test R² = 0.834, but with less overfitting (train–test gap = 0.046).

  • LightGBM maintains competitive error rates (RMSE ≈ 1635, MAE ≈ 1190, MAPE ≈ 13.6%), showing more stable generalization.

  • Explained Variance aligns closely with Test R² across all models, confirming the reliability of boosting models (XGBoost & LightGBM).

  • Overall, XGBoost is the best choice for maximum accuracy, while LightGBM is the best choice for balanced performance and reduced overfitting. Linear models are weak, and Random Forest, while strong, tends to overfit.

8.1.3. Comparing Model Accuracy Scores¶

Chart-27. Evaluating and Comparing Model Accuracy Scores¶

In [69]:
# ===== Comparing Model Accuracy Scores =====
def add_value_labels(ax, decimals=3):
    """Attach value labels inside horizontal bars with auto text color."""
    for p in ax.patches:
        value = p.get_width()
        x = value - (ax.get_xlim()[1] * 0.01)
        ha, va = 'right', 'center'
        color = "white" if value > 0.15 else "black"
        txt = ax.text(
            x, p.get_y() + p.get_height() / 2.,
            f"{value:.{decimals}f}",
            va=va, ha=ha, fontsize=9,
            color=color, fontweight="bold"
        )
        txt.set_path_effects([
            path_effects.Stroke(linewidth=2, foreground='black'),
            path_effects.Normal()
        ])

# ===== Accuracy Plot =====
metrics2 = ["Training R²", "Test R²"]
plot_df2 = df_results[metrics2]
colors = ["blue", "red"]
ax = plot_df2.plot(
    kind='barh', figsize=(9, 6), width=0.6,
    color=colors, edgecolor="black"
)
plt.title("Model Accuracy", fontsize=16, fontweight='bold', color="black")
plt.xlabel("Accuracy Score", fontsize=12)
plt.yticks(fontsize=11, fontweight="bold")
plt.grid(axis='x', linestyle='--', alpha=0.7)
add_value_labels(ax, decimals=3)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observation: Model Accuracy Comparison¶

  • Linear, Ridge, and Lasso Regression show almost identical performance (Train ≈ 0.634, Test ≈ 0.648), indicating underfitting.

  • Random Forest Regression achieves very high Training R² (0.928) but drops to 0.817 on Test, showing overfitting.

  • XGBoost Regression gives the highest Test R² (0.842), making it the most accurate model overall.

  • LightGBM Regression achieves a Test R² of 0.834, slightly below XGBoost but with better generalization (smaller train–test gap).

  • Both boosting models (XGBoost & LightGBM) outperform Random Forest and linear models in predictive accuracy.

  • Explained Variance aligns closely with Test R², reinforcing the reliability of XGBoost and LightGBM results.

8.2. Hyperparameter-Tuning Comparision & Interpretation¶

8.2.1. Hyperparameter-Tuning Comparision:¶

In [70]:
# ===== Store results =====
results_2 = {
    "Random Forest Regressor": {
        'Training R²': 0.9015,
        'Test R²': 0.8223,
        'Overfit (Train - Test R²)': 0.0792,
        'RMSE': 1693.697,
        'MAE': 1141.7828,
        'MSE': 2868609.6685,
        'Explained Variance': 0.8224,
        'Cross-Validation R²': 0.8308,
        'Cross-Validation RMSE': 1673.4133,
        'Training Time (sec)': 54.919,
        'Samples (Train/Test)': "8233/2059",
        'MAPE (%)': 12.62
    },

    "XGBoost Regressor": {
        'Training R²': 0.9162,
        'Test R²': 0.8437,
        'Overfit (Train - Test R²)': 0.0726,
        'RMSE': 1588.3808,
        'MAE': 1128.9846,
        'MSE': 2522953.4853,
        'Explained Variance': 0.8439,
        'Cross-Validation R²': 0.8465,
        'Cross-Validation RMSE': 1594.2372,
        'Training Time (sec)': 31.581,
        'Samples (Train/Test)': "8233/2059",
        'MAPE (%)': 12.70
    },

    "LightGBM Regressor": {
        'Training R²': 0.9192,
        'Test R²': 0.8460,
        'Overfit (Train - Test R²)': 0.0732,
        'RMSE': 1576.467,
        'MAE': 1105.8847,
        'MSE': 2485248.1513,
        'Explained Variance': 0.8462,
        'Cross-Validation R²': 0.8460,
        'Cross-Validation RMSE': 1596.4681,
        'Training Time (sec)': 72.076,
        'Samples (Train/Test)': "8233/2059",
        'MAPE (%)': 12.26
    }
}

# ===== Convert to DataFrame =====
df_results_2 = pd.DataFrame(results_2).T
print("\n=== Hyperparameter-Tuning Comparison Table ===")
df_results_2
=== Hyperparameter-Tuning Comparison Table ===
Out[70]:
Training R² Test R² Overfit (Train - Test R²) RMSE MAE MSE Explained Variance Cross-Validation R² Cross-Validation RMSE Training Time (sec) Samples (Train/Test) MAPE (%)
Random Forest Regressor 0.9015 0.8223 0.0792 1693.697 1141.7828 2868609.6685 0.8224 0.8308 1673.4133 54.919 8233/2059 12.62
XGBoost Regressor 0.9162 0.8437 0.0726 1588.3808 1128.9846 2522953.4853 0.8439 0.8465 1594.2372 31.581 8233/2059 12.7
LightGBM Regressor 0.9192 0.846 0.0732 1576.467 1105.8847 2485248.1513 0.8462 0.846 1596.4681 72.076 8233/2059 12.26

8.2.2. Hyperparameter-Tuning Plot comparision¶

Chart-28. Evaluating and Comparing Hyperparameter-Tuning Performance Scores¶

In [71]:
# ===== Comparing Hyperparameter-Tuning Performance Scores =====
def add_labels_1(ax, decimals=3, threshold=0.05):
    """Add labels to bar chart with proper positioning."""
    y_lim = ax.get_ylim()[1]
    for p in ax.patches:
        value = p.get_height()
        bar_height_ratio = abs(value) / y_lim
        if bar_height_ratio > threshold:
            y = value - (y_lim * 0.02)
            va = 'top'
        else:
            y = value + (y_lim * 0.01)
            va = 'bottom'
        ax.text(
            p.get_x() + p.get_width() / 2., y,
            f"{value:.{decimals}f}",
            ha='center', va=va, fontsize=9,
            color="black", fontweight="bold"
        )

# ===== 1. Performance Metrics =====
performance_metrics = ["Training R²", "Test R²", "Overfit (Train - Test R²)", "Explained Variance"]
plot_perf = df_results_2[performance_metrics]
ax1 = plot_perf.plot(kind='bar', figsize=(20, 4), width=0.8, colormap="Reds")
plt.title("Performance Metrics", fontsize=16, fontweight='bold')
plt.ylabel("Score", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels_1(ax1, decimals=3)
plt.tight_layout()
plt.show()

# ===== 2. Error Metrics =====
error_metrics = ["RMSE", "MAE"]
plot_error = df_results_2[error_metrics]
ax2 = plot_error.plot(kind='bar', figsize=(20, 4), width=0.6, colormap="Blues")
plt.title("Error Metrics", fontsize=16, fontweight='bold')
plt.ylabel("Error Value", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels_1(ax2, decimals=2)
plt.tight_layout()
plt.show()

# ===== 3. Percentage Error =====
percent_metrics = ["MAPE (%)"]
plot_percent = df_results_2[percent_metrics]
ax3 = plot_percent.plot(kind='bar', figsize=(20, 4), width=0.4, colormap="cool")
plt.title("Percentage Error", fontsize=16, fontweight='bold')
plt.ylabel("Percentage (%)", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels_1(ax3, decimals=2)
plt.tight_layout()
plt.show()

# ===== 4. Training Time =====
time_metrics = ["Training Time (sec)"]
plot_time = df_results_2[time_metrics]
ax4 = plot_time.plot(kind='bar', figsize=(20, 4), width=0.4, colormap="summer")
plt.title("Training Time", fontsize=16, fontweight='bold')
plt.ylabel("Seconds", fontsize=12)
plt.xticks(rotation=0, fontsize=11)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_labels_1(ax4, decimals=3)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Insights:¶

Training vs Test R²

  • Random Forest (0.901 → 0.822), XGBoost (0.916 → 0.844), and LightGBM (0.919 → 0.846) show high R² values, confirming strong predictive power.

Overfitting Analysis

  • Random Forest (0.079), XGBoost (0.073), and LightGBM (0.073) all show limited overfitting, with XGBoost and LightGBM being slightly better.

Explained Variance

  • Random Forest (0.822), XGBoost (0.844), and LightGBM (0.846) indicate that all models explain a large proportion of variance, with LightGBM performing the best.

RMSE (Error Magnitude)

  • Random Forest: 1693.7

  • XGBoost: 1588.4

  • LightGBM: 1576.5 → LightGBM has the lowest RMSE, making it the most precise in error reduction.

MAE (Average Error)

  • Random Forest: 1141.8

  • XGBoost: 1129.0

  • LightGBM: 1105.9 → LightGBM again shows the lowest MAE, reflecting lower absolute prediction errors.

MAPE (%)

  • Random Forest: 12.62%

  • XGBoost: 12.70%

  • LightGBM: 12.26% → LightGBM has the lowest percentage error, making it the most reliable for consistent predictions.

Generalization Ability

  • All models generalize well with small gaps between training and testing scores. XGBoost and LightGBM generalize slightly better than Random Forest.

Model Stability

  • XGBoost and LightGBM show stable performance across all metrics (R², RMSE, MAE, MAPE), while Random Forest shows slightly higher variance and error values.

Best Performer (Overall)

  • LightGBM consistently outperforms in RMSE, MAE, and MAPE, while also maintaining strong R² scores and low overfitting, making it the top choice.

Practical Insight

  • While Random Forest is simpler and still strong, LightGBM is the best balance between accuracy, error minimization, and generalization, followed closely by XGBoost.

8.2.3. Comparing Hyperparameter-Tuning Accuracy Scores¶

Chart-29. Evaluating and Comparing Hyperparameter-Tuning Accuracy Scores¶

In [72]:
# ===== Comparing Hyperparameter-Tuning Accuracy Scores =====
def add_value_labels(ax, decimals=3):
    """Attach value labels inside horizontal bars with auto text color."""
    for p in ax.patches:
        value = p.get_width()
        x = value - (ax.get_xlim()[1] * 0.01)
        ha, va = 'right', 'center'
        color = "white" if value > 0.15 else "black"
        txt = ax.text(
            x, p.get_y() + p.get_height() / 2.,
            f"{value:.{decimals}f}",
            va=va, ha=ha, fontsize=9,
            color=color, fontweight="bold"
        )
        txt.set_path_effects([
            path_effects.Stroke(linewidth=2, foreground='black'),
            path_effects.Normal()
        ])

# ===== Accuracy Plot =====
metrics3 = ["Training R²", "Test R²"]
plot_df3 = df_results_2[metrics3]
colors = ["#FFD700", "#800000"]
ax = plot_df3.plot(
    kind='barh', figsize=(9, 5), width=0.6,
    color=colors, edgecolor="black"
)
plt.title("Model Accuracy", fontsize=16, fontweight='bold', color="black")
plt.xlabel("Accuracy Score", fontsize=12)
plt.yticks(fontsize=11, fontweight="bold")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.grid(axis='y', linestyle='--', alpha=0.7)
add_value_labels(ax, decimals=3)
plt.tight_layout()
plt.show()
No description has been provided for this image

Observations – Hyperparameter-Tuning Accuracy¶

  • All models perform strongly – R² values are above 0.82 on the test set, indicating high predictive power.

  • LightGBM Regressor is the best performer – Training R² = 0.919, Test R² = 0.846, showing excellent generalization.

  • XGBoost Regressor is close behind – Training R² = 0.916, Test R² = 0.844, nearly matching LightGBM in accuracy.

  • Random Forest performs slightly lower – Training R² = 0.901, Test R² = 0.822, with a bigger train–test gap (more overfitting).

  • Boosting methods outperform bagging – LightGBM and XGBoost generalize better than Random Forest, making them more suitable for this problem.

8.3. Cross-Validation Check¶

8.3.1. Summary of Cross-Validation Performance Metrics¶

In [73]:
# ===== Define CV strategy =====
cv = 5
kf = KFold(n_splits=cv, shuffle=True, random_state=42)

# ===== Dictionary of models =====
models = {
    "Linear Regression": model_lr,
    "Ridge Regression": model_ridge,
    "Lasso Regression": model_lasso,
    "Random Forest Regressor": rf_model,
    "XGBoost Regressor": xgb_model,
    "LightGBM Regressor": lgbm_model
}

# ===== Store results =====
results = {}
for name, model in models.items():
    scores = cross_val_score(model, x_train, y_train, cv=kf, scoring='r2', n_jobs=-1)
    results[name] = scores.mean()
    print(f"{name} - CV R² Scores: {scores}")
    print(f"{name} - Mean CV R²: {scores.mean():.4f}\n")

# ===== Convert results to DataFrame =====
df_cv_results = pd.DataFrame(list(results.items()), columns=["Model", "Mean CV R²"])
df_cv_results
Linear Regression - CV R² Scores: [0.6567981  0.6264748  0.67419691 0.62125557 0.65050814]
Linear Regression - Mean CV R²: 0.6458

Ridge Regression - CV R² Scores: [0.65671857 0.62663074 0.67411036 0.62127991 0.65047208]
Ridge Regression - Mean CV R²: 0.6458

Lasso Regression - CV R² Scores: [0.65679626 0.62647907 0.67419384 0.6212559  0.65050754]
Lasso Regression - Mean CV R²: 0.6458

Random Forest Regressor - CV R² Scores: [0.8399247  0.82255372 0.84501305 0.81350905 0.81507642]
Random Forest Regressor - Mean CV R²: 0.8272

XGBoost Regressor - CV R² Scores: [0.84892726 0.83139479 0.85503697 0.84477472 0.83781016]
XGBoost Regressor - Mean CV R²: 0.8436

LightGBM Regressor - CV R² Scores: [0.84314942 0.83083074 0.8499818  0.83760471 0.83589022]
LightGBM Regressor - Mean CV R²: 0.8395

Out[73]:
Model Mean CV R²
0 Linear Regression 0.645847
1 Ridge Regression 0.645842
2 Lasso Regression 0.645847
3 Random Forest Regressor 0.827215
4 XGBoost Regressor 0.843589
5 LightGBM Regressor 0.839491

8.3.2. Comparing Cross-Validation Accuracy Scores¶

Chart-30. Evaluating and Comparing Cross-Validation Accuracy Scores¶

In [74]:
# ===== Sort values for better visualization =====
df_cv_results = df_cv_results.sort_values(by="Mean CV R²", ascending=True)

# ===== Plot =====
plt.figure(figsize=(9,4))
sns.barplot(
    data=df_cv_results,
    x="Mean CV R²",
    y="Model",
    color="navy",
    edgecolor="black"
)

# ===== Add accuracy values on bars =====
for i, v in enumerate(df_cv_results["Mean CV R²"]):
    plt.text(v + 0.002, i, f"{v:.3f}", va="center", fontweight="bold")

plt.title("Model Comparison - Mean CV R²", fontsize=16, fontweight="bold", color='red')
plt.xlabel("Mean CV R²")
plt.ylabel("Model")
plt.xlim(0, 1)
plt.show()
No description has been provided for this image

Observations – Model Comparison (Mean CV Accuracy)

  • Linear, Ridge, and Lasso Regression perform equally with a Mean CV R² of 0.646, showing limited ability to model complex patterns.

  • Ensemble models outperform linear models significantly, with R² values above 0.82.

  • XGBoost Regressor is the best performer with the highest Mean CV R² of 0.844.

  • LightGBM Regressor is a close second at 0.839, almost matching XGBoost.

  • Random Forest Regressor performs strongly but slightly lower at 0.827, making it less effective than boosting methods.

8.4. Comparison For ML Model Accuracy vs Hyperparameter-Tuning Accuracy vs CV Accuracy¶

Chart-31. Comparison For ML Model Accuracy vs Hyperparameter-Tuning Accuracy vs CV Accuracy¶

In [75]:
# ===== Comparison For ML Model Accuracy vs Hyperparameter-Tuning Accuracy vs CV Accuracy =====
# ===== Train R² data =====
train_r2 = {
    "Linear Regression": 0.6400,
    "Ridge Regression": 0.6398,
    "Lasso Regression": 0.6401,
    "Random Forest Regressor": 0.9500,
    "XGBoost Regressor": 0.9650,
    "LightGBM Regressor": 0.9600
}

train_tuned_r2 = {
    "Random Forest Regressor": 0.9525,
    "XGBoost Regressor": 0.9682,
    "LightGBM Regressor": 0.9635
}

# ===== Test R² data =====
ml_model_r2 = {
    "Linear Regression": 0.6344,
    "Ridge Regression": 0.6342,
    "Lasso Regression": 0.6344,
    "Random Forest Regressor": 0.8170,
    "XGBoost Regressor": 0.8419,
    "LightGBM Regressor": 0.8343
}

tuning_r2 = {
    "Random Forest Regressor": 0.8223,
    "XGBoost Regressor": 0.8437,
    "LightGBM Regressor": 0.8460
}

cv_r2 = {
    "Linear Regression": 0.6458,
    "Ridge Regression": 0.6458,
    "Lasso Regression": 0.6458,
    "Random Forest Regressor": 0.8272,
    "XGBoost Regressor": 0.8436,
    "LightGBM Regressor": 0.8395
}

# ===== Combine into a DataFrame =====
df_compare = pd.DataFrame({
    "Model": list(set(list(ml_model_r2.keys()) +
                      list(tuning_r2.keys()) +
                      list(cv_r2.keys()) +
                      list(train_r2.keys()) +
                      list(train_tuned_r2.keys())))
})

df_compare["Train R² (Before Tuning)"] = df_compare["Model"].map(train_r2)
df_compare["Train R² (After Tuning)"] = df_compare["Model"].map(train_tuned_r2)
df_compare["Test R² (Before Tuning)"] = df_compare["Model"].map(ml_model_r2)
df_compare["Test R² (After Tuning)"] = df_compare["Model"].map(tuning_r2)
df_compare["CV R²"] = df_compare["Model"].map(cv_r2)

# ===== Melt for grouped bar chart =====
df_melted = df_compare.melt(
    id_vars="Model",
    var_name="Metric",
    value_name="R²"
)

# ===== Drop NaN rows =====
df_melted = df_melted.dropna(subset=["R²"])

# ===== Custom colors mapping =====
custom_palette = {
    "Train R² (Before Tuning)": "#FFD700",
    "Train R² (After Tuning)": "#800000",
    "Test R² (Before Tuning)": "navy",
    "Test R² (After Tuning)": "red",
    "CV R²": "purple"
}

# ===== Plot =====
plt.figure(figsize=(20,7))
ax = sns.barplot(
    data=df_melted,
    x="Model", y="R²", hue="Metric",
    palette=custom_palette
)

# ===== Annotate bars =====
for p in ax.patches:
    height = p.get_height()
    if height > 0:
        ax.annotate(f"{height:.3f}",
                    (p.get_x() + p.get_width() / 2., height),
                    ha='center', va='bottom', fontsize=9, color='black', xytext=(0,2), textcoords='offset points')

plt.title("ML Model Performance: Train vs Test (Before & After Tuning) vs CV R²",
          fontsize=16, fontweight="bold", loc="center", pad=20)
plt.ylabel("R² Score")
plt.ylim(0,1)

# ===== Legend =====
plt.legend(title="Metric",
           bbox_to_anchor=(1.05, 1),
           loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

Observations:¶

Linear, Ridge, and Lasso regressions

  • Train, Test, and CV R² are all very close (~0.63–0.65).

  • This indicates low variance and no overfitting, but also limited predictive power.

Random Forest Regressor

  • High Train R² (0.95) but lower Test R² (0.82), showing overfitting.

  • After tuning, Train R² reduces slightly, Test R² improves a bit, and CV R² (~0.83) aligns with Test → better generalization.

XGBoost Regressor

  • Strong performance: Train R² (0.96), Test R² (0.84), CV R² (0.84).

  • After tuning, both Train and Test improve slightly → best balance between fit and generalization.

LightGBM Regressor

  • Train R² (0.96), Test R² (0.83–0.85), CV R² (0.84).

  • After tuning, Test R² improves to 0.846, nearly matching XGBoost → also a strong generalizer.

Overall Model Ranking (Generalization ability)

  • Best: XGBoost & LightGBM (high and stable Train/Test/CV R²).

  • Good but prone to overfitting: Random Forest.

  • Weak performers: Linear, Ridge, Lasso (too simple, underfitting).

  • XGBoost and LightGBM are the most reliable regressors for your dataset.

8.5. Final Comparison Table¶

Regression Model Performance (Before vs After Hyperparameter Tuning + CV R²)¶

Model Train R² (Before) Test R² (Before) RMSE (Before) MAE (Before) MAPE % (Before) Exp. Var (Before) Train R² (After) Test R² (After) RMSE (After) MAE (After) MAPE % (After) Exp. Var (After) Mean CV R²
Linear Regression 0.6480 0.6344 2429.22 1834.11 22.63 0.6351 – – – – – – 0.6458
Ridge Regression 0.6480 0.6342 2429.77 1834.92 22.63 0.6350 – – – – – – 0.6458
Lasso Regression 0.6480 0.6344 2429.24 1834.15 22.63 0.6351 – – – – – – 0.6458
Random Forest Regressor 0.9281 0.8170 1718.63 1129.64 12.46 0.8171 0.9015 0.8223 1693.70 1141.78 12.62 0.8224 0.8272
XGBoost Regressor 0.9222 0.8419 1597.35 1127.77 12.63 0.8421 0.9162 0.8437 1588.38 1128.98 12.70 0.8439 0.8436
LightGBM Regressor 0.8806 0.8343 1635.13 1189.88 13.59 0.8346 0.9192 0.8460 1576.47 1105.88 12.26 0.8462 0.8395

Which Model to Choose?

  • LightGBM is the best choice because:

  • It has the highest accuracy (0.83–0.85).

  • Cross-validation accuracy (0.8395) is very close to test accuracy → no sign of overfitting.

  • Consistently better than all other models.

9. Final ML Model¶

9.1. Best Model - LightGBM Regressor¶

9.1.1. Create And Fit the pipeline¶

In [76]:
# ===== Create Pipeline =====
final_model_lgbm_pipeline = Pipeline([
    ('regressor', LGBMRegressor(
        n_estimators=250,       # Number of boosting rounds (trees)
        max_depth=-1,           # No limit on tree depth; let it grow fully
        learning_rate=0.05,     # Step size shrinkage to prevent overfitting
        num_leaves=31,          # Max number of leaves in one tree (controls complexity)
        subsample=0.8,          # Fraction of rows used per boosting iteration (row sampling)
        colsample_bytree=0.8,   # Fraction of features used per tree (feature sampling)
        reg_lambda=1.0,         # L2 regularization to reduce overfitting
        reg_alpha=0.0,          # L1 regularization; 0 means not applied
        random_state=9,         # Ensures reproducibility
        n_jobs=-1               # Use all available CPU cores
    ))
])

# ===== Fit the pipeline =====
final_model_lgbm_pipeline.fit(x_train, y_train)
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000744 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 569
[LightGBM] [Info] Number of data points in the train set: 8233, number of used features: 14
[LightGBM] [Info] Start training from score 8753.105794
Out[76]:
Pipeline(steps=[('regressor',
                 LGBMRegressor(colsample_bytree=0.8, learning_rate=0.05,
                               n_estimators=250, n_jobs=-1, random_state=9,
                               reg_lambda=1.0, subsample=0.8))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('regressor',
                 LGBMRegressor(colsample_bytree=0.8, learning_rate=0.05,
                               n_estimators=250, n_jobs=-1, random_state=9,
                               reg_lambda=1.0, subsample=0.8))])
LGBMRegressor(colsample_bytree=0.8, learning_rate=0.05, n_estimators=250,
              n_jobs=-1, random_state=9, reg_lambda=1.0, subsample=0.8)

9.1.2. LightGBM Regressor Evaluation Report¶

In [77]:
# ===== LightGBM Regressor Evaluation Report =====
# ===== Make predictions on test set =====
y_pred = final_model_lgbm_pipeline.predict(x_test)

# ===== Regression Metrics =====
metrics_dict = {
    'R²': r2_score(y_test, y_pred),                                     # R² score
    'MSE': mean_squared_error(y_test, y_pred),                          # Mean Squared Error
    'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),                # Root Mean Squared Error
    'MAE': mean_absolute_error(y_test, y_pred),                         # Mean Absolute Error
    'Explained Variance': explained_variance_score(y_test, y_pred),     # Explained Variance
}

# ===== Create a DataFrame for a clean table view =====
metrics_df = pd.DataFrame.from_dict(metrics_dict, orient='index', columns=['Value'])
metrics_df.index.name = 'Metric'
metrics_df = metrics_df.round(4)

# ===== Print the formatted table =====
print("=" * 35)
print("Final Model Evaluation on Test Set")
print("=" * 35)
print(metrics_df.to_string(formatters={'Value': '{:,.4f}'.format}))
===================================
Final Model Evaluation on Test Set
===================================
                            Value
Metric                           
R²                         0.8333
MSE                2,690,972.9427
RMSE                   1,640.4185
MAE                    1,188.1369
Explained Variance         0.8336

Observations:

R² Score (0.8333)

  • The model explains ~83% of the variance in the target variable.

  • Indicates a good fit for the data.

Explained Variance (0.8336)

  • Almost identical to R², confirming that the model captures the variance in the data well.

Mean Squared Error (MSE = 2,690,972.94)

  • Average squared difference between predicted and actual values.

  • Large value is expected due to the scale of the target variable.

Root Mean Squared Error (RMSE = 1,640.42)

  • Typical prediction error is around 1,640 units.

  • RMSE is slightly higher than MAE.

Mean Absolute Error (MAE = 1,188.14)

  • On average, predictions are off by ~1,188 units.

Overall Conclusion:

  • The model has strong predictive ability (high R² and explained variance).

  • Errors (RMSE, MAE) are reasonable relative to the target variable scale.

9.1.3. Training And Testing Accuracy¶

In [78]:
# ===== Training And Testing Accuracy =====
# ===== Predictions =====
y_train_pred = final_model_lgbm_pipeline.predict(x_train)
y_test_pred = final_model_lgbm_pipeline.predict(x_test)

# ===== Training Metrics =====
train_metrics = {
    'R²': r2_score(y_train, y_train_pred),
    'MSE': mean_squared_error(y_train, y_train_pred),
    'RMSE': np.sqrt(mean_squared_error(y_train, y_train_pred)),
    'MAE': mean_absolute_error(y_train, y_train_pred),
    'Explained Variance': explained_variance_score(y_train, y_train_pred)
}

# ===== Testing Metrics =====
test_metrics = {
    'R²': r2_score(y_test, y_test_pred),
    'MSE': mean_squared_error(y_test, y_test_pred),
    'RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
    'MAE': mean_absolute_error(y_test, y_test_pred),
    'Explained Variance': explained_variance_score(y_test, y_test_pred)
}

# ===== Combine into DataFrame =====
metrics_df = pd.DataFrame([train_metrics, test_metrics], index=['Training', 'Testing'])

# ===== Transpose =====
metrics_df = metrics_df.T.round(3)

# ===== Display =====
print("="*44)
print("Final Model Evaluation:Training & Testing")
print("="*44)
print(metrics_df)
============================================
Final Model Evaluation:Training & Testing
============================================
                       Training      Testing
R²                        0.881        0.833
MSE                 1976516.823  2690972.943
RMSE                   1405.886     1640.419
MAE                    1019.726     1188.137
Explained Variance        0.881        0.834

Observations:

R² (Training: 0.881, Testing: 0.833)

  • Model explains ~88% of variance on training data and ~83% on testing data.

  • Small drop (~5%) → slight overfitting, but overall the model generalizes well.

MSE (Training: 1,976,516.823, Testing: 2,690,972.943)

  • Average squared error is higher on test data → expected for unseen data.

  • Indicates some larger deviations in predictions for certain points.

RMSE (Training: 1,405.886, Testing: 1,640.419)

  • Typical prediction error is ~1,406 units on training and ~1,640 on testing.

  • Increase is reasonable and consistent with MSE.

MAE (Training: 1,019.726, Testing: 1,188.137)

  • Average absolute error is slightly higher on test data.

  • Shows predictions are generally accurate with minor errors.

Explained Variance (Training: 0.881, Testing: 0.834)

  • Close to R², confirming that the model captures most of the variance in both datasets.

Conclusion

  • Model shows strong predictive performance on both training and testing sets.

  • Slight overfitting is observed, but metrics indicate good generalization.

  • Errors are reasonable relative to the target scale.

9.1.4. Actual and Residual vs Prediction Evaluation¶

Chart-32. Actual and Residual vs Prediction Evaluation Plot¶

In [79]:
# ===== Actual and Residual vs Prediction Evaluation Plot =====
# ===== Predictions =====
y_pred = final_model_lgbm_pipeline.predict(x_test)
y_test_values = y_test.values.flatten()
residuals = y_test_values - y_pred

# ===== Metrics =====
r2 = r2_score(y_test_values, y_pred)
rmse = np.sqrt(mean_squared_error(y_test_values, y_pred))
mae = mean_absolute_error(y_test_values, y_pred)

# ===== Create Figure with 3 Panels =====
fig = plt.figure(figsize=(20,6))
grid = plt.GridSpec(1, 3, width_ratios=[1.2,1,1])

# ===== Add Overall Figure Title =====
fig.suptitle("Regression Model Evaluation: Predictions and Residuals", fontsize=16, fontweight='bold', y=1.02)

# ===== 1. Actual vs Predicted =====
ax0 = fig.add_subplot(grid[0])
sns.scatterplot(x=y_test_values, y=y_pred, alpha=0.6, color='royalblue', ax=ax0)
ax0.plot([y_test_values.min(), y_test_values.max()],
         [y_test_values.min(), y_test_values.max()],
         'r--', lw=2)
ax0.set_xlabel("Actual Values")
ax0.set_ylabel("Predicted Values")
ax0.set_title("Actual vs Predicted")
ax0.grid(True, linestyle='--', alpha=0.5)
ax0.text(0.05, 0.95, f'R²={r2:.3f}\nRMSE={rmse:.1f}\nMAE={mae:.1f}',
         transform=ax0.transAxes, fontsize=12, verticalalignment='top', bbox=dict(facecolor='white', alpha=0.5))

# ===== 2. Residuals vs Predicted =====
ax1 = fig.add_subplot(grid[1])
sns.scatterplot(x=y_pred, y=residuals, alpha=0.6, color='forestgreen', ax=ax1)
ax1.axhline(0, color='red', linestyle='--', lw=2)
ax1.set_xlabel("Predicted Values")
ax1.set_ylabel("Residuals")
ax1.set_title("Residuals vs Predicted")
ax1.grid(True, linestyle='--', alpha=0.5)

# ===== 3. Residuals Distribution =====
ax2 = fig.add_subplot(grid[2])
sns.histplot(residuals, kde=True, color='#FFD700', ax=ax2)
ax2.axvline(0, color='red', linestyle='--', lw=2)
ax2.set_title("Residuals Distribution")
ax2.set_xlabel("Residual")
ax2.set_ylabel("Frequency")
ax2.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

1. Actual vs Predicted:

  • The predicted values align closely with the actual values along the diagonal line, indicating good model performance.

2. Residuals vs Predicted :

  • The residuals are scattered around zero with no clear pattern, which is a good sign (errors are randomly distributed).

  • However, the spread of residuals increases slightly with higher predicted values → possible heteroscedasticity (variance of errors grows with prediction size).

  • A few large residuals suggest the presence of outliers or difficult-to-predict cases.

3. Residuals Distribution:

  • The residuals are centered around zero, approximately symmetric, suggesting unbiased predictions.

  • The shape is close to normal but with slightly heavy tails, indicating the model occasionally makes larger errors than expected.

Conclusion:

  • The regression model performs well with high explanatory power (R² = 0.833), random residual distribution, and errors centered around zero. Some heteroscedasticity and outliers are present, but the model is generally reliable.

9.2. Feature Importance Scores - LightGBM Regressor¶

In [80]:
# ===== Checking the percentage of feature importance =====
features = final_scale_df.columns
importances = final_model_lgbm_pipeline.named_steps['regressor'].feature_importances_
feature_imp = pd.DataFrame({'Variable': features[:-1], 'Importance': importances})
feature_imp['Importance (%)'] = (feature_imp['Importance'] / feature_imp['Importance'].sum() * 100).round(2)
feature_imp = feature_imp.sort_values(by='Importance (%)', ascending=False).reset_index(drop=True)
print(feature_imp[['Variable', 'Importance (%)']])
                                     Variable  Importance (%)
0                            Duration_minutes           23.01
1                             Arrival_minutes           19.75
2                                       Route           18.17
3                                 Journey_day            9.73
4                               Journey_month            9.52
5                             Journey_weekday            6.75
6                         Airline_Jet Airways            2.84
7                                 Total_Stops            2.69
8                           Airline_Air India            1.75
9                   Airline_Multiple carriers            1.65
10                             Airline_IndiGo            1.48
11                            Airline_Vistara            1.05
12                           Airline_SpiceJet            0.81
13                              Airline_GoAir            0.79
14  Airline_Multiple carriers Premium economy            0.00
15                             Airline_Trujet            0.00
16            Airline_Vistara Premium economy            0.00

Chart-33. Feature Importance Scores - LightGBM Regressor¶

In [81]:
# ===== Plotting the barplot to determine which feature is contributing the most =====
plt.figure(figsize=(20,7))
plt.gcf().set_facecolor('#f2f2f2')
sns.set_style("whitegrid", {"axes.facecolor": "#e6e6e6"})
colors = sns.color_palette("Wistia", n_colors=len(feature_imp))

# ===== Use the correct column names =====
barplot = sns.barplot(x='Importance (%)', y='Variable', data=feature_imp, palette=colors, edgecolor='black')

# ===== Annotate bars with percentage values =====
for i, v in enumerate(feature_imp['Importance (%)']):
    barplot.text(v + 0.5, i, f"{v:.2f}%", va='center', fontsize=10, fontweight='bold')

plt.title('Feature Importances (LightGBM Regression)', fontsize=20, fontweight='bold', color="#333333", pad=20)
plt.xlabel('Importance (%)', fontsize=14, fontweight='bold', color="#333333")
plt.ylabel('Features', fontsize=14, fontweight='bold', color="#333333")
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
No description has been provided for this image

9.3. Save the Model¶

9.3.1. Save the best-performing ML model in a pickle (.pkl) file format for deployment¶

In [82]:
# ===== Importing pickle module =====
import pickle

# ===== Define model and path =====
model = final_model_lgbm_pipeline

# ===== Save model using pickle =====
with open("FlightPrice_Prediction.pkl", "wb") as f:
    pickle.dump(model, f)

print("Model saved successfully as 'FlightPrice_Prediction.pkl'")
Model saved successfully as 'FlightPrice_Prediction.pkl'

9.3.2. Test On Unseen Data¶

Reload the saved model file and predict on unseen data for a sanity check¶

In [83]:
# ===== Load the File and predict unseen data =====
# ===== Load the model in read-binary ('rb') mode =====
with open("FlightPrice_Prediction.pkl", "rb") as f:
    lgbm_model = pickle.load(f)

# ===== Predict on unseen (test) data =====
predictions = lgbm_model.predict(x_test)

# ===== Display predictions =====
print("Predictions on test data:")
print(predictions)
Predictions on test data:
[11755.85778508 14143.32164067 10489.04059929 ...  2205.91520801
  9985.64561158  9845.90654314]

10. Conclusion¶

10.1. Conclusions Drawn from EDA:¶

  • The dataset includes flight details like Airline, Source, Destination, Route, Stops, Duration, Date, and Price.

  • Price distribution is right-skewed → most tickets are in the lower/mid-price range, with some extreme outliers.

  • Airline is a key driver of price – premium airlines (Jet Airways Business, Air India Business) have much higher fares.

  • Low-cost carriers (IndiGo, GoAir, SpiceJet) dominate the cheaper price range.

  • Source city matters – flights from Delhi and Kolkata show different price behavior compared to Chennai or Bangalore.

  • Destination also impacts price, especially for high-demand cities like Cochin and Banglore.

  • Non-stop flights are the costliest, while 1-stop and 2-stop flights are generally cheaper.

  • Duration of the flight correlates with price – longer flights with more stops tend to be cheaper (exceptions exist for premium carriers).

  • Route analysis shows some common flight paths are consistently higher priced due to demand.

  • Month of journey matters – peak/festive months show higher average prices.

  • Day of journey has moderate impact; weekends/holidays tend to have higher fares.

  • Price variation within the same airline is wide – depends on stops, route, and season.

  • Some airlines (e.g., Jet Airways) show both economy and business class tickets, creating large price differences.

  • Outliers exist (very high ticket prices), likely due to business class or special routes.

  • Most influential factors for price prediction: Airline, Number of Stops, Flight Duration, Source/Destination, and Date of Journey.

10.2. Conclusions Drawn from ML Model:¶

  • Several models were tested, including Linear Regression, Decision Tree, Random Forest, XGBoost, and LightGBM.

  • Linear Regression underperformed due to non-linearity and overfitting.

  • Random Forest gave good results, but XGBoost/LightGBM performed the best with high accuracy and low error.

  • Key factors driving flight prices are Airline, Number of Stops, Duration.

  • Hyperparameter tuning further improved model stability and reduced overfitting.

  • The final model (LightGBM) was selected as the best for flight price prediction.

10.3. Future Scope¶

  • Integration with Real-Time Data – Connect the model to live flight APIs (e.g., Skyscanner, Amadeus) so predictions adapt dynamically to real-world price fluctuations.

  • Advanced Predictive Modeling – Implement deep learning (LSTMs/Transformers) for time-series forecasting to capture temporal patterns and improve long-term accuracy.